Part I Number Representation

Part I Number Representation
28. Reconfigurable Arithmetic Appendix: Past, Present, and Future Apr. 2012 Computer Arithmetic, Number Representation

About This Presentation
This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN ). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. © Behrooz Parhami Edition Released Revised First Jan. 2000 Sep. 2001 Sep. 2003 Sep. 2005 Apr. 2007 Apr. 2008 April 2009 Second Apr. 2010 Mar. 2011 Apr. 2012 Apr. 2012 Computer Arithmetic, Number Representation

I Background and Motivation
Number representation arguably the most important topic: Effects on system compatibility and ease of arithmetic 2’s-complement, redundant, residue number systems Limits of fast arithmetic Floating-point numbers to be covered in Chapter 17 Topics in This Part Chapter 1 Numbers and Arithmetic Chapter 2 Representing Signed Numbers Chapter 3 Redundant Number Systems Chapter 4 Residue Number Systems Apr. 2012 Computer Arithmetic, Number Representation

“This can’t be right . . . It goes into the red!”
Apr. 2012 Computer Arithmetic, Number Representation

1 Numbers and Arithmetic
Chapter Goals Define scope and provide motivation Set the framework for the rest of the book Review positional fixed-point numbers Chapter Highlights What goes on inside your calculator? Ways of encoding numbers in k bits Radices and digit sets: conventional, exotic Conversion from one system to another Dot notation: a useful visualization tool Apr. 2012 Computer Arithmetic, Number Representation

Numbers and Arithmetic: Topics
Topics in This Chapter 1.1 What is Computer Arithmetic? 1.2 Motivating Examples 1.3 Numbers and Their Encodings 1.4 Fixed-Radix Positional Number Systems 1.5 Number Radix Conversion 1.6 Classes of Number Representations Apr. 2012 Computer Arithmetic, Number Representation

1.1 What is Computer Arithmetic?
Pentium Division Bug ( ): Pentium’s radix-4 SRT algorithm occasionally gave incorrect quotient First noted in 1994 by Tom Nicely who computed sums of reciprocals of twin primes: 1/5 + 1/7 + 1/11 + 1/ /p + 1/(p + 2) Worst-case example of division error in Pentium: Apr. 2012 Computer Arithmetic, Number Representation

Top Ten Intel Slogans for the Pentium
Humor, circa 1995 (in the wake of the floating-point division bug) It’s a FLAW, dammit, not a bug It’s close enough, we say so Nearly 300 correct opcodes You don’t need to know what’s inside Redefining the PC –– and math as well We fixed it, really Division considered harmful Why do you think it’s called “floating” point? We’re looking for a few good flaws The errata inside Apr. 2012 Computer Arithmetic, Number Representation

Aspects of, and Topics in, Computer Arithmetic
Hardware (our focus in this book) Software ––––––––––––––––––––––––––––––––––––––––––––––––– –––––––––––––––––––––––––––––––––––– Design of efficient digital circuits for Numerical methods for solving primitive and other arithmetic operations systems of linear equations, such as +, –, , , , log, sin, and cos partial differential eq’ns, and so on Issues: Algorithms Issues: Algorithms Error analysis Error analysis Speed/cost trade-offs Computational complexity Hardware implementation Programming Testing, verification Testing, verification General-purpose Special-purpose –––––––––––––––––––––– ––––––––––––––––––––––– Flexible data paths Tailored to application Fast primitive areas such as: operations like Digital filtering +, –, , ,  Image processing Benchmarking Radar tracking Fig The scope of computer arithmetic. Apr. 2012 Computer Arithmetic, Number Representation

Computer Arithmetic, Number Representation
1.2 A Motivating Example Using a calculator with √, x2, and xy functions, compute: u = √√ … √ 2 = “1024th root of 2” v = 21/ = Save u and v; If you can’t save, recompute values when needed x = (((u2)2)...)2 = x' = u1024 = y = (((v2)2)...)2 = y' = v1024 = Perhaps v and u are not really the same value w = v – u = 1  10–11 Nonzero due to hidden digits (u – 1)  = [Hidden ... (0) 68] (v – 1)  = [Hidden ... (0) 69] Apr. 2012 Computer Arithmetic, Number Representation

Finite Precision Can Lead to Disaster
Example: Failure of Patriot Missile (1991 Feb. 25) Source American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile The Scud struck an American Army barracks, killing 28 Cause, per GAO/IMTEC report: “software problem” (inaccurate calculation of the time since boot) Problem specifics: Time in tenths of second as measured by the system’s internal clock was multiplied by 1/10 to get the time in seconds Internal registers were 24 bits wide 1/10 = (chopped to 24 b) Error ≈  2–23 ≈ 9.5  10–8 Error in 100-hr operation period ≈ 9.5  10–8  100  60  60  10 = 0.34 s Distance traveled by Scud = (0.34 s)  (1676 m/s) ≈ 570 m Apr. 2012 Computer Arithmetic, Number Representation

Inadequate Range Can Lead to Disaster
Example: Explosion of Ariane Rocket (1996 June 4) Source Unmanned Ariane 5 rocket of the European Space Agency veered off its flight path, broke up, and exploded only 30 s after lift-off (altitude of 3700 m) The $500 million rocket (with cargo) was on its first voyage after a decade of development costing $7 billion Cause: “software error in the inertial reference system” Problem specifics: A 64 bit floating point number relating to the horizontal velocity of the rocket was being converted to a 16 bit signed integer An SRI* software exception arose during conversion because the 64-bit floating point number had a value greater than what could be represented by a 16-bit signed integer (max ) *SRI = Système de Référence Inertielle or Inertial Reference System Apr. 2012 Computer Arithmetic, Number Representation

1.3 Numbers and Their Encodings
Some 4-bit number representation formats Exponent in {-2, -1, 0, 1} Significand in {0, 1, 2, 3} Base-2 logarithm Apr. 2012 Computer Arithmetic, Number Representation

Encoding Numbers in 4 Bits
Fig Some of the possible ways of assigning 16 distinct codes to represent numbers. Small triangles denote the radix point locations. Apr. 2012 Computer Arithmetic, Number Representation

1.4 Fixed-Radix Positional Number Systems
( xk–1xk– x1x0 . x–1x– x–l )r = xi r i One can generalize to: Arbitrary radix (not necessarily integer, positive, constant) Arbitrary digit set, usually {–a, –a+1, , b–1, b} = [–a, b] Example 1.1. Balanced ternary number system: Radix r = 3, digit set = [–1, 1] Example 1.2. Negative-radix number systems: Radix –r, r  2, digit set = [0, r – 1] The special case with radix –2 and digit set [0, 1] is known as the negabinary number system Apr. 2012 Computer Arithmetic, Number Representation

More Examples of Number Systems
Example 1.3. Digit set [–4, 5] for r = 10: (3 –1 5)ten represents = 300 – Example 1.4. Digit set [–7, 7] for r = 10: (3 –1 5)ten = ( –5)ten = (1 – –5)ten Example 1.7. Quater-imaginary number system: radix r = 2j, digit set [0, 3] Apr. 2012 Computer Arithmetic, Number Representation

1.5 Number Radix Conversion
Whole part Fractional part u = w . v = ( xk–1xk– x1x0 . x–1x– x–l )r Old = ( XK–1XK– X1X0 . X–1X– X–L )R New Example: (31)eight = (25)ten Oct. = 25 Dec Halloween = Xmas Radix conversion, using arithmetic in the old radix r Convenient when converting from r = 10 Radix conversion, using arithmetic in the new radix R Convenient when converting to R = 10 Apr. 2012 Computer Arithmetic, Number Representation

Radix Conversion: Old-Radix Arithmetic
Converting whole part w: (105)ten = (?)five Repeatedly divide by five Quotient Remainder 105 0 21 1 4 4 Therefore, (105)ten = (410)five Converting fractional part v: ( )ten = (410.?)five Repeatedly multiply by five Whole Part Fraction Therefore, ( )ten  ( )five Apr. 2012 Computer Arithmetic, Number Representation

Radix Conversion: New-Radix Arithmetic
Converting whole part w: (22033)five = (?)ten ((((2  5) + 2)  5 + 0)  5 + 3)  5 + 3 |-----| : : : : : : : : | | : : : : : : | | : : : : | | : : | | 1518 Horner’s rule or formula Converting fractional part v: ( )five = (105.?)ten ( )five  55 = (22033)five = (1518)ten 1518 / 55 = 1518 / = Therefore, ( )five = ( )ten Horner’s rule is also applicable: Proceed from right to left and use division instead of multiplication Apr. 2012 Computer Arithmetic, Number Representation

Horner’s Rule for Fractions
Converting fractional part v: ( )five = (?)ten (((((3 / 5) + 3) / 5 + 0) / 5 + 2) / 5 + 2) / 5 |-----| : : : : : : : : | | : : : : : : | | : : : : | | : : | | 2.4288 | | Horner’s rule or formula Fig Horner’s rule used to convert ( )five to decimal. Apr. 2012 Computer Arithmetic, Number Representation

1.6 Classes of Number Representations
Integers (fixed-point), unsigned: Chapter 1 Integers (fixed-point), signed Signed-magnitude, biased, complement: Chapter 2 Signed-digit, including carry/borrow-save: Chapter 3 (but the key point of Chapter 3 is using redundancy for faster arithmetic, not how to represent signed values) Residue number system: Chapter 4 (again, the key to Chapter 4 is use of parallelism for faster arithmetic, For the most part you need: - 2’s complement numbers - Carry-save representation - IEEE floating-point format However, knowing the rest of the material (including RNS) provides you with more options when designing custom and special-purpose hardware systems Real numbers, floating-point: Chapter 17 Part V deals with real arithmetic Real numbers, exact: Chapter 20 Continued-fraction, slash, . . . Apr. 2012 Computer Arithmetic, Number Representation

Dot Notation: A Useful Visualization Tool
+  (a) Addition (b) Multiplication Fig Dot notation to depict number representation formats and arithmetic algorithms. Apr. 2012 Computer Arithmetic, Number Representation

2 Representing Signed Numbers
Chapter Goals Learn different encodings of the sign info Discuss implications for arithmetic design Chapter Highlights Using sign bit, biasing, complementation Properties of 2’s-complement numbers Signed vs unsigned arithmetic Signed numbers, positions, or digits Extended dot notation: posibits and negabits Apr. 2012 Computer Arithmetic, Number Representation

Representing Signed Numbers: Topics
Topics in This Chapter 2.1 Signed-Magnitude Representation 2.2 Biased Representations 2.3 Complement Representations 2.4 2’s- and 1’s-Complement Numbers 2.5 Direct and Indirect Signed Arithmetic 2.6 Using Signed Positions or Signed Digits Apr. 2012 Computer Arithmetic, Number Representation

2.1 Signed-Magnitude Representation
Fig A 4-bit signed-magnitude number representation system for integers. Apr. 2012 Computer Arithmetic, Number Representation

Signed-Magnitude Adder
Fig Adding signed-magnitude numbers using precomplementation and postcomplementation. Apr. 2012 Computer Arithmetic, Number Representation

2.2 Biased Representations
Fig A 4-bit biased integer number representation system with a bias of 8. Apr. 2012 Computer Arithmetic, Number Representation

Arithmetic with Biased Numbers
Addition/subtraction of biased numbers x + y + bias = (x + bias) + (y + bias) – bias x – y + bias = (x + bias) – (y + bias) + bias A power-of-2 (or 2a – 1) bias simplifies addition/subtraction Comparison of biased numbers: Compare like ordinary unsigned numbers find true difference by ordinary subtraction We seldom perform arbitrary arithmetic on biased numbers Main application: Exponent field of floating-point numbers Apr. 2012 Computer Arithmetic, Number Representation

2.3 Complement Representations
Fig Complement representation of signed integers. Apr. 2012 Computer Arithmetic, Number Representation

Arithmetic with Complement Representations
Table Addition in a complement number system with complementation constant M and range [–N, +P] ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Desired Computation to be Correct result Overflow operation performed mod M with no overflow condition (+x) + (+y) x + y x + y x + y > P (+x) + (–y) x + (M – y) x – y if y  x N/A M – (y – x) if y > x (–x) + (+y) (M – x) + y y – x if x  y N/A M – (x – y) if x > y (–x) + (–y) (M – x) + (M – y) M – (x + y) x + y > N Apr. 2012 Computer Arithmetic, Number Representation

Example and Two Special Cases
Example -- complement system for fixed-point numbers: Complementation constant M = Fixed-point number range [–6.000, ] Represent –3.258 as – = 8.742 Auxiliary operations for complement representations complementation or change of sign (computing M – x) computations of residues mod M Thus, M must be selected to simplify these operations Two choices allow just this for fixed-point radix-r arithmetic with k whole digits and l fractional digits Radix complement M = rk Digit complement M = rk – ulp (aka diminished radix compl) ulp (unit in least position) stands for rl Allows us to forget about l, even for nonintegers Apr. 2012 Computer Arithmetic, Number Representation

2.4 2’s- and 1’s-Complement Numbers
Two’s complement = radix complement system for r = 2 M = 2k 2k – x = [(2k – ulp) – x] + ulp = xcompl + ulp Range of representable numbers in with k whole bits: from –2k–1 to 2k–1 – ulp Fig A 4-bit 2’s-complement number representation system for integers. Apr. 2012 Computer Arithmetic, Number Representation

1’s-Complement Number Representation
One’s complement = digit complement (diminished radix complement) system for r = 2 M = 2k – ulp (2k – ulp) – x = xcompl Range of representable numbers in with k whole bits: from –2k–1 + ulp to 2k–1 – ulp Fig A 4-bit 1’s-complement number representation system for integers. Apr. 2012 Computer Arithmetic, Number Representation

Some Details for 2’s- and 1’s Complement
Range/precision extension for 2’s-complement numbers . . . xk–1 xk–1 xk–1 xk–1 xk– x1 x0 . x–1 x– x–l  Sign extension  Sign LSD  Extension  bit Range/precision extension for 1’s-complement numbers . . . xk–1 xk–1 xk–1 xk–1 xk– x1 x0 . x–1 x– x–l xk–1 xk–1 xk–  Sign extension  Sign LSD  Extension  bit Mod-2k operation needed in 2’s-complement arithmetic is trivial: Simply drop the carry-out (subtract 2k if result is 2k or greater) Mod-(2k – ulp) operation needed in 1’s-complement arithmetic is done via end-around carry (x + y) – (2k – ulp) = (x – y – 2k) + ulp Connect cout to cin Apr. 2012 Computer Arithmetic, Number Representation

Which Complement System Is Better?
Table Comparing radix- and digit-complement number representation systems ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Feature/Property Radix complement Digit complement Symmetry (P = N?) Possible for odd r Possible for even r (radices of practical interest are even) Unique zero? Yes No, there are two 0s Complementation Complement all digits Complement all digits and add ulp Mod-M addition Drop the carry-out End-around carry Apr. 2012 Computer Arithmetic, Number Representation

Why 2’s-Complement Is the Universal Choice
Can replace this mux with k XOR gates Fig Adder/subtractor architecture for 2’s-complement numbers. Apr. 2012 Computer Arithmetic, Number Representation

Signed-Magnitude vs 2’s-Complement
Signed-magnitude adder/subtractor is significantly more complex than a simple adder Fig. 2.7 Fig. 2.2 2’s-complement adder/subtractor needs very little hardware other than a simple adder Apr. 2012 Computer Arithmetic, Number Representation

2.5 Direct and Indirect Signed Arithmetic
Fig Direct versus indirect operation on signed numbers. Direct signed arithmetic is usually faster (not always) Indirect signed arithmetic can be simpler (not always); allows sharing of signed/unsigned hardware when both operation types are needed Apr. 2012 Computer Arithmetic, Number Representation

2.6 Using Signed Positions or Signed Digits
A key property of 2’s-complement numbers that facilitates direct signed arithmetic: x = ( )two’s-compl – – = –90 Check: –x = ( )two = 90 Fig Interpreting a 2’s-complement number as having a negatively weighted most-significant digit. Apr. 2012 Computer Arithmetic, Number Representation

Associating a Sign with Each Digit
Signed-digit representation: Digit set [-a, b] instead of [0, r – 1] Example: Radix-4 representation with digit set [-1, 2] rather than [0, 3] Fig Converting a standard radix-4 integer to a radix-4 integer with the nonstandard digit set [–1, 2]. Apr. 2012 Computer Arithmetic, Number Representation

Redundant Signed-Digit Representations
Signed-digit representation: Digit set [-a, b], with r = a + b + 1 – r > 0 Example: Radix-4 representation with digit set [-2, 2] Fig Converting a standard radix-4 integer to a radix-4 integer with the nonstandard digit set [–2, 2]. Here, the transfer does not propagate, so conversion is “carry-free” Apr. 2012 Computer Arithmetic, Number Representation

Extended Dot Notation: Posibits and Negabits
Posibit, or simply bit: positively weighted Negabit: negatively weighted Unsigned positive-radix number 2’s-complement number Negative-radix number Fig Extended dot notation depicting various number representation formats. Apr. 2012 Computer Arithmetic, Number Representation

Extended Dot Notation in Use
+  (a) Addition (b) Multiplication Fig Example arithmetic algorithms represented in extended dot notation. Apr. 2012 Computer Arithmetic, Number Representation

3 Redundant Number Systems
Chapter Goals Explore the advantages and drawbacks of using more than r digit values in radix r Chapter Highlights Redundancy eliminates long carry chains Redundancy takes many forms: trade-offs Redundant/nonredundant conversions Redundancy used for end values too? Extended dot notation with redundancy Apr. 2012 Computer Arithmetic, Number Representation

Redundant Number Systems: Topics
Topics in This Chapter 3.1 Coping with the Carry Problem 3.2 Redundancy in Computer Arithmetic 3.3 Digit Sets and Digit-Set Conversions 3.4 Generalized Signed-Digit Numbers 3.5 Carry-Free Addition Algorithms 3.6 Conversions and Support Functions Apr. 2012 Computer Arithmetic, Number Representation

3.1 Coping with the Carry Problem
Ways of dealing with the carry propagation problem: 1. Limit propagation to within a small number of bits (Chapters 3-4) 2. Detect end of propagation; don’t wait for worst case (Chapter 5) 3. Speed up propagation via lookahead etc. (Chapters 6-7) 4. Ideal: Eliminate carry propagation altogether! (Chapter 3) Operand digits in [0, 9] –––––––––––––––––––––––––––––––––– Position sums in [0, 18] But how can we extend this beyond a single addition? + Apr. 2012 Computer Arithmetic, Number Representation

Addition of Redundant Numbers
Position sum decomposition [0, 36] = 10  [0, 2] + [0, 16] Absorption of transfer digit [0, 16] + [0, 2] = [0, 18] Fig Adding radix-10 numbers with digit set [0, 18]. Apr. 2012 Computer Arithmetic, Number Representation

Meaning of Carry-Free Addition
Interim sum at position i Operand digits at position i Transfer digit into position i Fig Ideal and practical carry-free addition schemes. Apr. 2012 Computer Arithmetic, Number Representation

Redundancy Index -a b So, redundancy helps us achieve carry-free addition But how much redundancy is actually needed? Is [0, 11] enough for r = 10? Redundancy index r = a + b + 1 – r For example, – 10 = 2 Fig Adding radix-10 numbers with digit set [0, 11]. Apr. 2012 Computer Arithmetic, Number Representation

3.2 Redundancy in Computer Arithmetic
Binary Inputs One or more arithmetic operations Output Binary-to-redundant converter Redundant-to-binary converter Overhead (often zero) Overhead (always nonzero) The more the amount of computation performed between the initial forward conversion and final reverse conversion (reconversion), the greater the benefits of redundant representation. Same block diagram applies to residue number systems of Chapter 4. Apr. 2012 Computer Arithmetic, Number Representation

Binary Carry-Save or Stored-Carry Representation
Oldest example of redundancy in computer arithmetic is the stored-carry representation (carry-save addition) Fig Addition of four binary numbers, with the sum obtained in stored-carry form. Apr. 2012 Computer Arithmetic, Number Representation

Hardware for Carry-Save Addition
Two-bit encoding for binary stored-carry digits used in this implementation: 0 represented as 0 0 1 represented as 0 1 or as 2 represented as 1 1 Because in carry-save addition, three binary numbers are reduced to two binary numbers, this process is sometimes referred to as 3-2 compression Fig Using an array of independent binary full adders to perform carry-save addition. Apr. 2012 Computer Arithmetic, Number Representation

Carry-Save Addition in Dot Notation
4-to-2 reduction 3-to-2 reduction Fig. 9.3 From text on computer architecture (Parhami, Oxford/2005) We sometimes find it convenient to use an extended dot notation, with heavy dots (●) for posibits and hollow dots (○) for negabits Eight-bit, 2’s-complement number ○ ● ● ● ● ● ● ● Negative-radix number ○ ● ○ ● ○ ● ○ ● BSD number with n, p encoding ○ ○ ○ ○ ○ ○ ○ ○ of the digit set [-1, 1] ● ● ● ● ● ● ● ● Apr. 2012 Computer Arithmetic, Number Representation

Example for the Use of Extended Dot Notation
2’s-complement multiplicand ○ ● ● ● ● ● ● ● 2’s-complement multiplier ○ ● ● ● ● ● ● ● ○ ● ● ● ● ● ● ● ○ ● ● ● ● ● ● ● ○ ● ● ● ● ● ● ● ○ ● ● ● ● ● ● ● ○ ● ● ● ● ● ● ● ○ ● ● ● ● ● ● ● ○ ● ● ● ● ● ● ● ● ○ ○ ○ ○ ○ ○ ○ ○ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Multiplication of 2’s-complement numbers Option 1: sign extension -x -x x x x x Option 2: Baugh-Wooley method -x -y 1- x y Apr. 2012 Computer Arithmetic, Number Representation

3.3 Digit Sets and Digit-Set Conversions
Example 3.1: Convert from digit set [0, 18] to [0, 9] in radix 10 = 10 (carry 1) + 8 = 10 (carry 1) + 3 = 10 (carry 1) + 1 = 10 (carry 1) + 8 = 10 (carry 1) + 0 = 10 (carry 1) + 2 Answer; all digits in [0, 9] Note: Conversion from redundant to nonredundant representation always involves carry propagation Thus, the process is sequential and slow Apr. 2012 Computer Arithmetic, Number Representation

Conversion from Carry-Save to Binary
Example 3.2: Convert from digit set [0, 2] to [0, 1] in radix 2 = 2 (carry 1) + 0 = 2 (carry 1) + 0 = 2 (carry 1) + 0 = 2 (carry 1) + 0 Answer; all digits in [0, 1] Another way: Decompose the carry-save number into two numbers and add them: st number: sum bits nd number: carry bits –––––––––––––––––––––––––––––––––––––––– Sum Apr. 2012 Computer Arithmetic, Number Representation

Conversion Between Redundant Digit Sets
Example 3.3: Convert from digit set [0, 18] to [-6, 5] in radix 10 (same as Example 3.1, but with the target digit set signed and redundant) = 20 (carry 2) – 2 = 10 (carry 1) + 4 = 10 (carry 1) + 1 = 20 (carry 2) – 2 = 10 (carry 1) + 1 = 10 (carry 1) + 2 Answer; all digits in [-6, 5] On line 2, we could have written 14 = 20 (carry 2) – 6; this would have led to a different, but equivalent, representation In general, several representations may exist for a redundant digit set Apr. 2012 Computer Arithmetic, Number Representation

Carry-Free Conversion to a Redundant Digit Set
Example 3.4: Convert from digit set [0, 2] to [-1, 1] in radix 2 (same as Example 3.2, but with the target digit set signed and redundant) Carry-free conversion: Carry-save number –1 – Interim digits in [–1, 0] Transfer digits in [0, 1] –––––––––––––––––––––––––––––––––––––––– Answer; all digits in [–1, 1] We rewrite 2 as 2 (carry 1) + 0, and 1 as 2 (carry 1) – 1 A carry of 1 is always absorbed by the interim digit that is in {-1, 0} Apr. 2012 Computer Arithmetic, Number Representation

3.4 Generalized Signed-Digit Numbers
Radix r Digit set [–a, b] Requirement a + b + 1  r Redundancy index r = a + b + 1 – r Fig. 3.6 A taxonomy of redundant and non-redundant positional number systems. Apr. 2012 Computer Arithmetic, Number Representation

Encodings for Signed Digits
xi s, v 2’s-compl n, p n, z, p 1 01 001 –1 11 10 100 00 010 –1 11 10 100 00 010 BSD representation of +6 Sign and value encoding 2-bit 2’s-complement Negative & positive flags 1-out-of-3 encoding Fig Four encodings for the BSD digit set [–1, 1]. Posibit {0, 1} Negabit {–1, 0} Doublebit {0, 2} Negadoublebit {–2, 0} Unibit {–1, 1} (a) Extended dot notation (n, p) encoding 2’s-compl. encoding (b) Encodings for a BSD number Two of the encodings above can be shown in extended dot notation Fig Extended dot notation and its use in visualizing some BSD encodings. Apr. 2012 Computer Arithmetic, Number Representation

Hybrid Signed-Digit Numbers
Radix-8 GSD with digit set [-4,7] Fig Example of addition for hybrid signed-digit numbers. The hybrid-redundant representation above in extended dot notation: n, p -encoded ○ ● ● ○ ● ● ○ ● ● Nonredundant binary signed digit ● ● ● binary positions Apr. 2012 Computer Arithmetic, Number Representation

Hybrid Redundancy in Extended Dot Notation
Radix-8 digit set [–4, 7] Radix-8 digit set [–4, 4] Fig Two hybrid-redundant representations in extended dot notation. Apr. 2012 Computer Arithmetic, Number Representation

3.5 Carry-Free Addition Algorithms
Carry-free addition of GSD numbers Compute the position sums pi = xi + yi Divide pi into a transfer ti+1 and interim sum wi = pi – rti+1 Add incoming transfers to get the sum digits si = wi + ti wi If the transfer digits ti are in [–l, m], we must have: –a + l  pi – rti+1  b – m interim sum Smallest interim sum Largest interim sum if a transfer of –l if a transfer of m is to be absorbable is to be absorbable These constraints lead to: l  a / (r – 1) m  b / (r – 1) Apr. 2012 Computer Arithmetic, Number Representation

Is Carry-Free Addition Always Applicable?
No: It requires one of the following two conditions a. r > 2, r  3 b. r > 2, r = 2, a  1, b  1 e.g., not [-1, 10] in radix 10 In other words, it is inapplicable for r = 2 Perhaps most useful case r = 1 e.g., carry-save r = 2 with a = 1 or b = 1 e.g., carry/borrow-save BSD fails on at least two criteria! Fortunately, in the latter cases, a limited-carry addition algorithm is always applicable Apr. 2012 Computer Arithmetic, Number Representation

Limited-Carry Addition
Example: BSD addition 1 -1 Fig Some implementations for limited-carry addition. Estimate, or early warning Apr. 2012 Computer Arithmetic, Number Representation

Limited-Carry BSD Addition
Fig Limited-carry addition of radix-2 numbers with digit set [–1, 1] using carry estimates. A position sum –1 is kept intact when the incoming transfer is in [0, 1], whereas it is rewritten as 1 with a carry of –1 for incoming transfer in [–1, 0]. This guarantees that ti  wi and thus –1  si  1. Apr. 2012 Computer Arithmetic, Number Representation

3.6 Conversions and Support Functions
Example 3.10: Conversion from/to BSD to/from standard binary BSD representation of +6 Positive part Negative part Difference = Conversion result The negative and positive parts above are particularly easy to obtain if the BSD number has the n, p encoding Conversion from redundant to nonredundant representation always requires full carry propagation Conversion from nonredundant to redundant is often trivial Apr. 2012 Computer Arithmetic, Number Representation

Other Arithmetic Support Functions
Zero test: Zero has a unique code under some conditions Sign test: Needs carry propagation Overflow: May be real or apparent (result may be representable) Overflow and its detection in GSD arithmetic. xk–1 xk– x1 x0 k-digit GSD operands + yk–1 yk– y1 y0 ––––––––––––––––––––––––––– pk–1 pk– p1 p0 Position sums | | | | wk–1 wk– w1 w0 Interim sum digits ⁄ ⁄ ⁄ ⁄ tk tk– t2 t Transfer digits sk–1 sk– s1 s0 k-digit apparent sum Apr. 2012 Computer Arithmetic, Number Representation

4 Residue Number Systems
Chapter Goals Study a way of encoding large numbers as a collection of smaller numbers to simplify and speed up some operations Chapter Highlights Moduli, range, arithmetic operations Many sets of moduli possible: tradeoffs Conversions between RNS and binary The Chinese remainder theorem Why are RNS applications limited? Apr. 2012 Computer Arithmetic, Number Representation

Residue Number Systems: Topics
Topics in This Chapter 4.1 RNS Representation and Arithmetic 4.2 Choosing the RNS Moduli 4.3 Encoding and Decoding of Numbers 4.4 Difficult RNS Arithmetic Operations 4.5 Redundant RNS Representations 4.6 Limits of Fast Arithmetic in RNS Apr. 2012 Computer Arithmetic, Number Representation

4.1 RNS Representations and Arithmetic
Puzzle, due to the Chinese scholar Sun Tzu,1500+ years ago: What number has the remainders of 2, 3, and 2 when divided by 7, 5, and 3, respectively? Residues (akin to digits in positional systems) uniquely identify the number, hence they constitute a representation Pairwise relatively prime moduli: mk–1 > > m1 > m0 The residue xi of x wrt the ith modulus mi (similar to a digit): xi = x mod mi = xmi RNS representation contains a list of k residues or digits: x = (2 | 3 | 2)RNS(7|5|3) Default RNS for this chapter: RNS(8 | 7 | 5 | 3) Apr. 2012 Computer Arithmetic, Number Representation

RNS Dynamic Range Product M of the k pairwise relatively prime moduli is the dynamic range M = mk–1   m1  m0 For RNS(8 | 7 | 5 | 3), M = 8  7  5  3 = 840 Negative numbers: Complement relative to M –xmi = M – xmi 21 = (5 | 0 | 1 | 0)RNS –21 = (8 – 5 | 0 | 5 – 1 | 0)RNS = (3 | 0 | 4 | 0)RNS We can take the range of RNS(8|7|5|3) to be [-420, 419] or any other set of 840 consecutive integers Here are some example numbers in our default RNS(8 | 7 | 5 | 3): (0 | 0 | 0 | 0)RNS Represents 0 or 840 or (1 | 1 | 1 | 1)RNS Represents 1 or 841 or (2 | 2 | 2 | 2)RNS Represents 2 or 842 or (0 | 1 | 3 | 2)RNS Represents 8 or 848 or (5 | 0 | 1 | 0)RNS Represents 21 or 861 or (0 | 1 | 4 | 1)RNS Represents 64 or 904 or (2 | 0 | 0 | 2)RNS Represents –70 or 770 or (7 | 6 | 4 | 2)RNS Represents –1 or 839 or Apr. 2012 Computer Arithmetic, Number Representation

RNS as Weighted Representation
For RNS(8 | 7 | 5 | 3), the weights of the 4 positions are: Example: (1 | 2 | 4 | 0)RNS represents the number 105   0840 = 1689840 = 9 For RNS(7 | 5 | 3), the weights of the 3 positions are: Example -- Chinese puzzle: (2 | 3 | 2)RNS(7|5|3) represents the number 15    2105 = 233105 = 23 We will see later how the weights can be determined for a given RNS Apr. 2012 Computer Arithmetic, Number Representation

RNS Encoding and Arithmetic Operations
Fig The structure of an adder, subtractor, or multiplier for RNS(8|7|5|3). Fig Binary-coded format for RNS(8 | 7 | 5 | 3). Arithmetic in RNS(8 | 7 | 5 | 3) (5 | 5 | 0 | 2)RNS Represents x = +5 (7 | 6 | 4 | 2)RNS Represents y = –1 (4 | 4 | 4 | 1)RNS x + y : 5 + 78 = 4, 5 + 67 = 4, etc. (6 | 6 | 1 | 0)RNS x – y : 5 – 78 = 6, 5 – 67 = 6, etc. (alternatively, find –y and add to x) (3 | 2 | 0 | 1)RNS x  y : 5  78 = 3, 5  67 = 2, etc. Apr. 2012 Computer Arithmetic, Number Representation

4.2 Choosing the RNS Moduli
Target range for our RNS: Decimal values [0, ] Strategy 1: To minimize the largest modulus, and thus ensure high-speed arithmetic, pick prime numbers in sequence Pick m0 = 2, m1 = 3, m2 = 5, etc. After adding m5 = 13: RNS(13 | 11 | 7 | 5 | 3 | 2) M = Inadequate RNS(17 | 13 | 11 | 7 | 5 | 3 | 2) M = Too large RNS(17 | 13 | 11 | 7 | 3 | 2) M = Just right! = 19 bits Fine tuning: Combine pairs of moduli 2 & 13 (26) and 3 & 7 (21) RNS(26 | 21 | 17 | 11) M = Apr. 2012 Computer Arithmetic, Number Representation

An Improved Strategy Target range for our RNS: Decimal values [0, ] Strategy 2: Improve strategy 1 by including powers of smaller primes before proceeding to the next larger prime RNS(22 | 3) M = 12 RNS(32 | 23 | 7 | 5) M = 2520 RNS(11 | 32 | 23 | 7 | 5) M = RNS(13 | 11 | 32 | 23 | 7 | 5) M = (remove one 3, combine 3 & 5) RNS(15 | 13 | 11 | 23 | 7) M = = 18 bits Fine tuning: Maximize the size of the even modulus within the 4-bit limit RNS(24 | 13 | 11 | 32 | 7 | 5) M = Too large We can now remove 5 or 7; not an improvement in this example Apr. 2012 Computer Arithmetic, Number Representation

Low-Cost RNS Moduli Target range for our RNS: Decimal values [0, ] Strategy 3: To simplify the modular reduction (mod mi) operations, choose only moduli of the forms 2a or 2a – 1, aka “low-cost moduli” RNS(2ak–1 | 2ak–2 – 1 | | 2a1 – 1 | 2a0 – 1) We can have only one even modulus 2ai – 1 and 2aj – 1 are relatively prime iff ai and aj are relatively prime RNS(23 | 23–1 | 22–1) basis: 3, 2 M = 168 RNS(24 | 24–1 | 23–1) basis: 4, 3 M = 1680 RNS(25 | 25–1 | 23–1 | 22–1) basis: 5, 3, 2 M = RNS(25 | 25–1 | 24–1 | 23–1) basis: 5, 4, 3 M = Comparison RNS(15 | 13 | 11 | 23 | 7) 18 bits M = RNS(25 | 25–1 | 24–1 | 23–1) 17 bits M = Apr. 2012 Computer Arithmetic, Number Representation

Low- and Moderate-Cost RNS Moduli
Target range for our RNS: Decimal values [0, ] Strategy 4: To simplify the modular reduction (mod mi) operations, choose moduli of the forms 2a, 2a – 1, or 2a + 1 RNS(2ak–1 | 2ak–2  1 | | 2a1  1 | 2a0  1) We can have only one even modulus 2ai – 1 and 2aj + 1 are relatively prime RNS(25 | 24–1 | 24+1 | 23–1) M = RNS(25 | 24+1 | 23+1 | 23–1 | 22–1) M = Neither 5 nor 3 is acceptable The modulus 2a + 1 is not as convenient as 2a – 1 (needs an extra bit for residue, and modular operations are not as simple) Diminished-1 representation of values in [0, 2a] is a way to simplify things Represent 0 by a special flag bit and nonzero values by coding one less Apr. 2012 Computer Arithmetic, Number Representation

4.3 Encoding and Decoding of Numbers
Binary Inputs One or more arithmetic Operations Output Binary-to-RNS converter RNS-to-binary converter Encoding or forward conversion Decoding or reverse conversion Example: Digital filter The more the amount of computation performed between the initial forward conversion and final reverse conversion (reconversion), the greater the benefits of RNS representation. Apr. 2012 Computer Arithmetic, Number Representation

Conversion from Binary/Decimal to RNS
Example 4.1: Represent the number y = ( )two = (164)ten in RNS(8 | 7 | 5 | 3) The mod-8 residue is easy to find x3 = y8 = (100)two = 4 We have y = ; thus x2 = y7 =  7 = 3 x1 = y5 =  5 = 4 x0 = y3 =  3 = 2 Table 4.1 Residues of the first 10 powers of 2 ––––––––––––––––––––––––––––– i 2i 2i7 2i5 2i3 Apr. 2012 Computer Arithmetic, Number Representation

Conversion from RNS to Mixed-Radix Form
MRS(mk–1 | | m2 | m1 | m0) is a k-digit positional system with weights mk–2...m2m1m m2m1m0 m1m m0 1 and digit sets [0, mk–1–1] [0,m3–1] [0,m2–1] [0,m1–1] [0,m0–1] Example: (0 | 3 | 1 | 0)MRS(8|7|5|3) = 0 15 + 13 + 01 = 48 RNS-to-MRS conversion problem: y = (xk–1 | | x2 | x1 | x0)RNS = (zk–1 | | z2 | z1 | z0)MRS MRS representation allows magnitude comparison and sign detection Example: 48 versus 45 (0 | 6 | 3 | 0)RNS vs (5 | 3 | 0 | 0)RNS (000 | 110 | 011 | 00)RNS vs (101 | 011 | 000 | 00)RNS Equivalent mixed-radix representations (0 | 3 | 1 | 0)MRS vs (0 | 3 | 0 | 0)MRS (000 | 011 | 001 | 00)MRS vs (000 | 011 | 000 | 00)MRS Apr. 2012 Computer Arithmetic, Number Representation

Conversion from RNS to Binary/Decimal
Theorem 4.1 (The Chinese remainder theorem) x = (xk–1 | | x2 | x1 | x0)RNS =  i Mi ai ximi M where Mi = M/mi and ai = Mi –1mi (multiplicative inverse of Mi wrt mi) Implementing CRT-based RNS-to-binary conversion x =  i Mi ai ximi M =  i fi(xi) M We can use a table to store the fi values –- i mi entries Table Values needed in applying the Chinese remainder theorem to RNS(8 | 7 | 5 | 3) –––––––––––––––––––––––––––––– i mi xi Mi ai ximiM Apr. 2012 Computer Arithmetic, Number Representation

Intuitive Justification for CRT
Puzzle: What number has the remainders of 2, 3, and 2 when divided by the numbers 7, 5, and 3, respectively? x = (2 | 3 | 2)RNS(7|5|3) = (?)ten (1 | 0 | 0)RNS(7|5|3) = multiple of 15 that is 1 mod 7 = 15 (0 | 1 | 0)RNS(7|5|3) = multiple of 21 that is 1 mod 5 = 21 (0 | 0 | 1)RNS(7|5|3) = multiple of 35 that is 1 mod 3 = 70 (2 | 3 | 2)RNS(7|5|3) = (2 | 0 | 0) + (0 | 3 | 0) + (0 | 0 | 2) = 2  (1 | 0 | 0) + 3  (0 | 1 | 0) + 2  (0 | 0 | 1) = 2    70 = = 233 = 23 mod 105 Therefore, x = (23)ten Apr. 2012 Computer Arithmetic, Number Representation

4.4 Difficult RNS Arithmetic Operations
Sign test and magnitude comparison are difficult Example: Of the following RNS(8 | 7 | 5 | 3) numbers: Which, if any, are negative? Which is the largest? Which is the smallest? Assume a range of [–420, 419] a = (0 | 1 | 3 | 2)RNS b = (0 | 1 | 4 | 1)RNS c = (0 | 6 | 2 | 1)RNS d = (2 | 0 | 0 | 2)RNS e = (5 | 0 | 1 | 0)RNS f = (7 | 6 | 4 | 2)RNS Answers: d < c < f < a < e < b –70 < –8 < –1 < < < 64 Apr. 2012 Computer Arithmetic, Number Representation

Approximate CRT Decoding
Theorem 4.1 (The Chinese remainder theorem, scaled version) Divide both sides of CRT equality by M to get scaled version of x in [0, 1) x = (xk–1 | | x2 | x1 | x0)RNS =  i Mi ai ximi M x/M =  i ai ximi / mi 1 =  i gi(xi) 1 where mod-1 summation implies that we discard the integer parts Errors can be estimated and kept in check for the particular application Table Values needed in applying the approximate Chinese remainder theorem decoding to RNS(8 | 7 | 5 | 3) –––––––––––––––––––––––––––––– i mi xi ai ximi / mi Apr. 2012 Computer Arithmetic, Number Representation

General RNS Division General RNS division, as opposed to division by one of the moduli (aka scaling), is difficult; hence, use of RNS is unlikely to be effective when an application requires many divisions Scheme proposed in 1994 PhD thesis of Ching-Yu Hung (UCSB): Use an algorithm that has built-in tolerance to imprecision, and apply the approximate CRT decoding to choose quotient digits Example –– SRT algorithm (s is the partial remainder) s < 0 quotient digit = –1 s  0 quotient digit = 0 s > 0 quotient digit = 1 The BSD quotient can be converted to RNS on the fly Apr. 2012 Computer Arithmetic, Number Representation

4.5 Redundant RNS Representations
[0, 15] [0, 12] [0, 15] [0, 11] if cout = 1 [0, 15] Fig Adding a 4-bit ordinary mod-13 residue x to a 4-bit pseudoresidue y, producing a 4-bit mod-13 pseudoresidue z. Fig A modulo-m multiply-add cell that accumulates the sum into a double-length redundant pseudoresidue. Apr. 2012 Computer Arithmetic, Number Representation

4.6 Limits of Fast Arithmetic in RNS
Known results from number theory Theorem 4.2: The ith prime pi is asymptotically i ln i Theorem 4.3: The number of primes in [1, n] is asymptotically n / ln n Theorem 4.4: The product of all primes in [1, n] is asymptotically en Implications to speed of arithmetic in RNS Theorem 4.5: It is possible to represent all k-bit binary numbers in RNS with O(k / log k) moduli such that the largest modulus has O(log k) bits That is, with fast log-time adders, addition needs O(log log k) time Apr. 2012 Computer Arithmetic, Number Representation

Limits for Low-Cost RNS
Known results from number theory Theorem 4.6: The numbers 2a – 1 and 2b – 1 are relatively prime iff a and b are relatively prime Theorem 4.7: The sum of the first i primes is asymptotically O(i2 ln i) Implications to speed of arithmetic in low-cost RNS Theorem 4.8: It is possible to represent all k-bit binary numbers in RNS with O((k / log k)1/2) low-cost moduli of the form 2a – 1 such that the largest modulus has O((k log k)1/2) bits Because a fast adder needs O(log k) time, asymptotically, low-cost RNS offers little speed advantage over standard binary Apr. 2012 Computer Arithmetic, Number Representation

Disclaimer About RNS Representations
RNS representations are sometimes referred to as “carry-free” Positional representation does not support totally carry-free addition; but it appears that RNS does allow digitwise arithmetic However even though each RNS digit is processed independently (for +, –, ), the size of the digit set is dependent on the desired range (grows at least double-logarithmically with the range M, or logarithmically with the word width k in the binary representation of the same range) Apr. 2012 Computer Arithmetic, Number Representation

Part II Addition / Subtraction
28. Reconfigurable Arithmetic Appendix: Past, Present, and Future Apr. 2012 Computer Arithmetic, Addition/Subtraction

This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN ). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. © Behrooz Parhami Edition Released Revised First Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 Apr. 2007 Apr. 2008 Apr. 2009 Second Apr. 2010 Mar. 2011 Apr. 2012 Apr. 2012 Computer Arithmetic, Addition/Subtraction

II Addition / Subtraction
Review addition schemes and various speedup methods Addition is a key op (in itself, and as a building block) Subtraction = negation + addition Carry propagation speedup: lookahead, skip, select, … Two-operand versus multioperand addition Topics in This Part Chapter 5 Basic Addition and Counting Chapter 6 Carry-Lookahead Adders Chapter 7 Variations in Fast Adder Chapter 8 Multioperand Addition Apr. 2012 Computer Arithmetic, Addition/Subtraction

Computer Arithmetic, Addition/Subtraction
Apr. 2012 Computer Arithmetic, Addition/Subtraction

5 Basic Addition and Counting
Chapter Goals Study the design of ripple-carry adders, discuss why their latency is unacceptable, and set the foundation for faster adders Chapter Highlights Full adders are versatile building blocks Longest carry chain on average: log2k bits Fast asynchronous adders are simple Counting is relatively easy to speed up Key part of a fast adder is its carry network Apr. 2012 Computer Arithmetic, Addition/Subtraction

Basic Addition and Counting: Topics
Topics in This Chapter 5.1 Bit-Serial and Ripple-Carry Adders 5.2 Conditions and Exceptions 5.3 Analysis of Carry Propagation 5.4 Carry Completion Detection 5.5 Addition of a Constant 5.6 Manchester Carry Chains and Adders Apr. 2012 Computer Arithmetic, Addition/Subtraction

5.1 Bit-Serial and Ripple-Carry Adders
Half-adder (HA): Truth table and block diagram Full-adder (FA): Truth table and block diagram Apr. 2012 Computer Arithmetic, Addition/Subtraction

Half-Adder Implementations
c Fig Three implementations of a half-adder. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Full-Adder Implementations
Fig Possible designs for a full-adder in terms of half-adders, logic gates, and CMOS transmission gates. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Full-Adder Implementations
Fig (alternate version) Possible designs for a full-adder in terms of half-adders, logic gates, and CMOS transmission gates. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Some Full-Adder Details
Logic equations for a full-adder: s = x  y  cin (odd parity function) = x y cin  x  y  cin  x  y cin  x y  cin cout = x y  x cin  y cin (majority function) CMOS transmission gate and its use in a 2-to-1 mux. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Simple Adders Built of Full-Adders
Fig Using full-adders in building bit-serial and ripple-carry adders. Apr. 2012 Computer Arithmetic, Addition/Subtraction

VLSI Layout of a Ripple-Carry Adder
Fig The layout of a 4-bit ripple-carry adder in CMOS implementation [Puck94]. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Critical Path Through a Ripple-Carry Adder
Tripple-add = TFA(x,ycout) + (k – 2)TFA(cincout) + TFA(cins) Fig Critical path in a k-bit ripple-carry adder. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Binary Adders as Versatile Building Blocks
Set one input to 0: cout = AND of other inputs Set one input to 1: cout = OR of other inputs Set one input to 0 and another to 1: s = NOT of third input Fig Four-bit binary adder used to realize the logic function f = w + xyz and its complement. Apr. 2012 Computer Arithmetic, Addition/Subtraction

5.2 Conditions and Exceptions
Fig Two’s-complement adder with provisions for detecting conditions and exceptions. overflow2’s-compl = xk–1 yk–1 sk–1  xk–1 yk–1 sk–1 overflow2’s-compl = ck  ck–1 = ck ck–1  ck ck–1 Apr. 2012 Computer Arithmetic, Addition/Subtraction

Saturating Adders Saturating (saturation) arithmetic: When a result’s magnitude is too large, do not wrap around; rather, provide the most positive or the most negative value that is representable in the number format Example – In 8-bit 2’s-complement format, we have:  18 (wraparound); sat 26  127 (saturating) Saturating arithmetic in desirable in many DSP applications Saturation value Overflow 1 Adder Designing saturating adders Unsigned (quite easy) Signed (only slightly harder) Apr. 2012 Computer Arithmetic, Addition/Subtraction

5.3 Analysis of Carry Propagation
Bit positions cout cin \__________/\__________________/ \________/\____/ Carry chains and their lengths Fig Example addition and its carry propagation chains. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Using Probability to Analyze Carry Propagation
Given binary numbers with random bits, for each position i we have Probability of carry generation = ¼ (both 1s) Probability of carry annihilation = ¼ (both 0s) Probability of carry propagation = ½ (different) Probability that carry generated at position i propagates through position j – 1 and stops at position j (j > i) 2–(j–1–i)  1/2 = 2–(j–i) Expected length of the carry chain that starts at position i 2 – 2–(k–i–1) Average length of the longest carry chain in k-bit addition is strictly less than log2k; it is log2(1.25k) per experimental results Analogy: Expected number when rolling one die is 3.5; if one rolls many dice, the expected value of the largest number shown grows Apr. 2012 Computer Arithmetic, Addition/Subtraction

5.4 Carry Completion Detection
bi ci 0 0 Carry not yet known 0 1 Carry known to be 1 1 0 Carry known to be 0 Fig The carry network of an adder with two-rail carries and carry completion detection logic. Apr. 2012 Computer Arithmetic, Addition/Subtraction

5.5 Addition of a Constant: Counters
Fig An up (down) counter built of a register, an incrementer (decrementer), and a multiplexer. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Implementing a Simple Up Counter
(Fm arch text) Ripple-carry incrementer for use in an up counter. Fig Four-bit asynchronous up counter built only of negative-edge-triggered T flip-flops. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Faster and Constant-Time Counters
Any fast adder design can be specialized and optimized to yield a fast counter (carry-lookahead, carry-skip, etc.) One can use redundant representation to build a constant-time counter, but a conversion penalty must be paid during read-out Fig Fast (constant-time) three-stage up counter. Apr. 2012 Computer Arithmetic, Addition/Subtraction

5.6 Manchester Carry Chains and Adders
Sum digit in radix r si = (xi + yi + ci) mod r Special case of radix 2 si = xi  yi  ci Computing the carries ci is thus our central problem For this, the actual operand digits are not important What matters is whether in a given position a carry is generated, propagated, or annihilated (absorbed) For binary addition: gi = xi yi pi = xi  yi ai = xiyi  = (xi  yi)  It is also helpful to define a transfer signal: ti = gi  pi = ai = xi  yi Using these signals, the carry recurrence is written as ci+1 = gi  ci pi = gi  ci gi  ci pi = gi  ci ti Apr. 2012 Computer Arithmetic, Addition/Subtraction

Manchester Carry Network
The worst-case delay of a Manchester carry chain has three components: 1. Latency of forming the switch control signals 2. Set-up time for switches 3. Signal propagation delay through k switches Fig One stage in a Manchester carry chain. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Details of a 5-Bit Manchester Carry Network
Dynamic logic, with 2-phase operation Clock low: Precharge (ci = 0) Clock high: Pull-down (if gi = 1) The transistors must be sized appropriately for maximum speed Smaller transistors Larger transistors i = 4 i = 3 i = 2 i = 1 i = 0 c0 c5 c0 c1 c2 c3 c4 Carry chain of a 5-bit Manchester adder. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Carry Network is the Essence of a Fast Adder
gi = xi yi pi = xi  yi Ripple; Skip; Lookahead; Parallel-prefix Fig Generic structure of a binary adder, highlighting its carry network. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Ripple-Carry Adder Revisited
The carry recurrence: ci+1 = gi  pi ci Latency of k-bit adder is roughly 2k gate delays: 1 gate delay for production of p and g signals, plus 2(k – 1) gate delays for carry propagation, plus 1 XOR gate delay for generation of the sum bits Fig Alternate view of a ripple-carry network in connection with the generic adder structure shown in Fig Apr. 2012 Computer Arithmetic, Addition/Subtraction

The Complete Design of a Ripple-Carry Adder
gi = xi yi pi = xi  yi Fig (ripple-carry network) superimposed on Fig (generic adder). Apr. 2012 Computer Arithmetic, Addition/Subtraction

6 Carry-Lookahead Adders
Chapter Goals Understand the carry-lookahead method and its many variations used in the design of fast adders Chapter Highlights Single- and multilevel carry lookahead Various designs for log-time adders Relating the carry determination problem to parallel prefix computation Implementing fast adders in VLSI Apr. 2012 Computer Arithmetic, Addition/Subtraction

Carry-Lookahead Adders: Topics
Topics in This Chapter 6.1 Unrolling the Carry Recurrence 6.2 Carry-Lookahead Adder Design 6.3 Ling Adder and Related Designs 6.4 Carry Determination as Prefix Computation 6.5 Alternative Parallel Prefix Networks 6.6 VLSI Implementation Aspects Apr. 2012 Computer Arithmetic, Addition/Subtraction

6.1 Unrolling the Carry Recurrence
Recall the generate, propagate, annihilate (absorb), and transfer signals: Signal Radix r Binary gi is 1 iff xi + yi  r xi yi pi is 1 iff xi + yi = r – 1 xi  yi ai is 1 iff xi + yi < r – 1 xiyi  = (xi  yi)  ti is 1 iff xi + yi  r – 1 xi  yi si (xi + yi + ci) mod r xi  yi  ci The carry recurrence can be unrolled to obtain each carry signal directly from inputs, rather than through propagation ci = gi–1  ci–1 pi–1 = gi–1  (gi–2  ci–2 pi–2) pi–1 = gi–1  gi–2 pi–1  ci–2 pi–2 pi–1 = gi–1  gi–2 pi–1  gi–3 pi–2 pi–1  ci–3 pi–3 pi–2 pi–1 = gi–1  gi–2 pi–1  gi–3 pi–2 pi–1  gi–4 pi–3 pi–2 pi–1  ci–4 pi–4 pi–3 pi–2 pi–1 = . . . Note: Addition symbol vs logical OR Apr. 2012 Computer Arithmetic, Addition/Subtraction

Full Carry Lookahead s0 s1 s2 s3 y0 y1 y2 y3 x0 x1 x2 x3 cin . . . Theoretically, it is possible to derive each sum digit directly from the inputs that affect it Carry-lookahead adder design is simply a way of reducing the complexity of this ideal, but impractical, arrangement by hardware sharing among the various lookahead circuits Apr. 2012 Computer Arithmetic, Addition/Subtraction

Four-Bit Carry-Lookahead Adder
Fig Four-bit carry network with full lookahead. Complexity reduced by deriving the carry-out indirectly Full carry lookahead is quite practical for a 4-bit adder c1 = g0  c0 p0 c2 = g1  g0 p1  c0 p0 p1 c3 = g2  g1 p2  g0 p1 p2  c0 p0 p1 p2 c4 = g3  g2 p3  g1 p2 p3  g0 p1 p2 p3  c0 p0 p1 p2 p3 Apr. 2012 Computer Arithmetic, Addition/Subtraction

Carry Lookahead Beyond 4 Bits
Consider a 32-bit adder c1 = g0  c0 p0 c2 = g1  g0 p1  c0 p0 p1 c3 = g2  g1 p2  g0 p1 p2  c0 p0 p1 p2 . c31 = g30  g29 p30  g28 p29 p30  g27 p28 p29 p30   c0 p0 p1 p2 p3 ... p29 p30 No circuit sharing: Repeated computations 32-input AND 32-input OR . . . High fan-ins necessitate tree-structured circuits Apr. 2012 Computer Arithmetic, Addition/Subtraction

Two Solutions to the Fan-in Problem
High-radix addition (i.e., radix 2h) Increases the latency for generating g and p signals and sum digits, but simplifies the carry network (optimal radix?) Multilevel lookahead Example: 16-bit addition Radix-16 (four digits) Two-level carry lookahead (four 4-bit blocks) Either way, the carries c4, c8, and c12 are determined first c16 c15 c14 c13 c12 c11 c10 c9 c8 c7 c6 c5 c4 c3 c2 c1 c0 cout ? ? ? cin Apr. 2012 Computer Arithmetic, Addition/Subtraction

6.2 Carry-Lookahead Adder Design
Block generate and propagate signals g [i,i+3] = gi+3  gi+2 pi+3  gi+1 pi+2 pi+3  gi pi+1 pi+2 pi+3 p [i,i+3] = pi pi+1 pi+2 pi+3 Fig. 6.2b Schematic diagram of a 4-bit lookahead carry generator. Apr. 2012 Computer Arithmetic, Addition/Subtraction

A Building Block for Carry-Lookahead Addition
Fig. 6.2a A 4-bit lookahead carry generator Fig. 6.1 A 4-bit carry network Apr. 2012 Computer Arithmetic, Addition/Subtraction

Combining Block g and p Signals
Block generate and propagate signals can be combined in the same way as bit g and p signals to form g and p signals for wider blocks Fig Combining of g and p signals of four (contiguous or overlapping) blocks of arbitrary widths into the g and p signals for the overall block [i0, j3]. Apr. 2012 Computer Arithmetic, Addition/Subtraction

A Two-Level Carry-Lookahead Adder
Fig Building a 64-bit carry-lookahead adder from bit adders and 5 lookahead carry generators. Carry-out: cout = g [0,k–1]  c0 p [0,k–1] = xk–1yk–1  sk–1 (xk–1  yk–1) Apr. 2012 Computer Arithmetic, Addition/Subtraction

Latency of a Multilevel Carry-Lookahead Adder
Latency through the 16-bit CLA adder consists of finding: g and p for individual bit positions 1 gate level g and p signals for 4-bit blocks 2 gate levels Block carry-in signals c4, c8, and c12 2 gate levels Internal carries within 4-bit blocks 2 gate levels Sum bits gate levels Total latency for the 16-bit adder 9 gate levels (compare to 32 gate levels for a 16-bit ripple-carry adder) Each additional lookahead level adds 4 gate levels of latency Latency for k-bit CLA adder: Tlookahead-add = 4 log4k + 1 gate levels Apr. 2012 Computer Arithmetic, Addition/Subtraction

6.3 Ling Adder and Related Designs
Consider the carry recurrence and its unrolling by 4 steps: ci = gi–1  ci–1 ti–1 = gi–1  gi–2 ti–1  gi–3 ti–2 ti–1  gi–4 ti–3 ti–2 ti–1  ci–4 ti–4 ti–3 ti–2 ti–1 Ling’s modification: Propagate hi = ci  ci–1 instead of ci hi = gi–1  hi–1 ti–2 = gi–1  gi–2  gi–3 ti–2  gi–4 ti–3 ti–2  hi–4 ti–4 ti–3 ti–2 CLA: 5 gates max 5 inputs 19 gate inputs Ling: 4 gates max 5 inputs 14 gate inputs The advantage of hi over ci is even greater with wired-OR: CLA: 4 gates max 5 inputs 14 gate inputs Ling: 3 gates max 4 inputs 9 gate inputs Once hi is known, however, the sum is obtained by a slightly more complex expression compared with si = pi  ci si = pi  hi ti–1 Propagate harry, not carry! Apr. 2012 Computer Arithmetic, Addition/Subtraction

6.4 Carry Determination as Prefix Computation
Fig Combining of g and p signals of two (contiguous or overlapping) blocks B' and B" of arbitrary widths into the g and p signals for block B. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Formulating the Prefix Computation Problem
The problem of carry determination can be formulated as: Given (g0, p0) (g1, p1) (gk–2, pk–2) (gk–1, pk–1) Find (g [0,0] , p [0,0]) (g [0,1] , p [0,1]) (g [0,k–2] , p [0,k–2]) (g [0,k–1] , p [0,k–1]) c1 c ck–1 ck Carry-in can be viewed as an extra (-1) position: (g–1, p–1) = (cin, 0) The desired pairs are found by evaluating all prefixes of (g0, p0) ¢ (g1, p1) ¢ ¢ (gk–2, pk–2) ¢ (gk–1, pk–1) The carry operator ¢ is associative, but not commutative [(g1, p1) ¢ (g2, p2)] ¢ (g3, p3) = (g1, p1) ¢ [(g2, p2) ¢ (g3, p3)] Prefix sums analogy: Given x0 x1 x xk–1 Find x0 x0+x1 x0+x1+x x0+x1+...+xk–1 Apr. 2012 Computer Arithmetic, Addition/Subtraction

Example Prefix-Based Carry Network
g0, p0 g1, p1 g2, p2 g3, p3 g[0,0], p[0,0] = (c1, --) g[0,1], p[0,1] = (c2, --) g[0,2], p[0,2] = (c3, --) g[0,3], p[0,3] = (c4, --) 2 6 5 -1 Fig Four-input parallel prefix sums network and its corresponding carry network. + + (a) A 4-input prefix sums network + + 7 12 5 6 Scan order g0, p0 g1, p1 g2, p2 g3, p3 g[0,0], p[0,0] = (c1, --) g[0,1], p[0,1] = (c2, --) g[0,2], p[0,2] = (c3, --) g[0,3], p[0,3] = (c4, --) (b) A 4-bit Carry lookahead network Apr. 2012 Computer Arithmetic, Addition/Subtraction

6.5 Alternative Parallel Prefix Networks
Fig Ladner-Fischer parallel prefix sums network built of two k/2-input networks and k/2 adders. Delay recurrence D(k) = D(k/2) + 1 = log2k Cost recurrence C(k) = 2C(k/2) + k/2 = (k/2) log2k Apr. 2012 Computer Arithmetic, Addition/Subtraction

The Brent-Kung Recursive Construction
Fig Parallel prefix sums network built of one k/2-input network and k – 1 adders. Delay recurrence D(k) = D(k/2) + 2 = 2 log2k – 1 (–2 really) Cost recurrence C(k) = C(k/2) + k – 1 = 2k – 2 – log2k Apr. 2012 Computer Arithmetic, Addition/Subtraction

Brent-Kung Carry Network (8-Bit Adder)
Apr. 2012 Computer Arithmetic, Addition/Subtraction

Brent-Kung Carry Network (16-Bit Adder)
Reason for latency being 2 log2k – 2 Fig Brent-Kung parallel prefix graph for 16 inputs. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Kogge-Stone Carry Network (16-Bit Adder)
Cost formula C(k) = (k – 1) + (k – 2) + (k – 4) + (k – k/2) = k log2k – k + 1 log2k levels (minimum possible) Fig Kogge-Stone parallel prefix graph for 16 inputs. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Speed-Cost Tradeoffs in Carry Networks
Method Delay Cost Ladner-Fischer log2k (k/2) log2k Kogge-Stone k log2k – k + 1 Brent-Kung 2 log2k – 2 2k – 2 – log2k Improving the Ladner/Fischer design These outputs can be produced one time unit later without increasing the overall latency This strategy saves enough to make the overall cost linear (best possible) Apr. 2012 Computer Arithmetic, Addition/Subtraction

Hybrid B-K/K-S Carry Network (16-Bit Adder)
Brent-Kung: 6 levels 26 cells Kogge-Stone: 4 levels 49 cells Fig. 6.11 A Hybrid Brent-Kung/ Kogge-Stone parallel prefix graph for 16 inputs. Hybrid: 5 levels 32 cells Apr. 2012 Computer Arithmetic, Addition/Subtraction

6.6 VLSI Implementation Aspects
Example: Radix-256 addition of 56-bit numbers as implemented in the AMD Am29050 CMOS micro Our description is based on the 64-bit version of the adder In radix-256, 64-bit addition, only these carries are needed: c56 c48 c40 c32 c24 c16 c8 First, 4-bit Manchester carry chains (MCCs) of Fig. 6.12a are used to derive g and p signals for 4-bit blocks Next, the g and p signals for 4-bit blocks are combined to form the desired carries, using the MCCs in Fig. 6.12b Apr. 2012 Computer Arithmetic, Addition/Subtraction

Four-Bit Manchester Carry Chains
Fig Example 4-bit Manchester carry chain designs in CMOS technology [Lync92]. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Carry Network for 64-Bit Adder
Fig Spanning-tree carry-lookahead network [Lync92]. Type-a and Type-b MCCs refer to the circuits of Figs. 6.12a and 6.12b, respectively. Apr. 2012 Computer Arithmetic, Addition/Subtraction

7 Variations in Fast Adders
Chapter Goals Study alternatives to the carry-lookahead method for designing fast adders Chapter Highlights Many methods besides CLA are available (both competing and complementary) Best design is technology-dependent (often hybrid rather than pure) Knowledge of timing allows optimizations Apr. 2012 Computer Arithmetic, Addition/Subtraction

Variations in Fast Adders: Topics
Topics in This Chapter 7.1 Simple Carry-Skip Adders 7.2 Multilevel Carry-Skip Adders 7.3 Carry-Select Adders 7.4 Conditional-Sum Adder 7.5 Hybrid Designs and Optimizations 7.6 Modular Two-Operand Adders Apr. 2012 Computer Arithmetic, Addition/Subtraction

7.1 Simple Carry-Skip Adders
(a) Ripple-carry adder (b) Simple carry-skip adder Ripple-carry stages 4-bit block c0 c4 c12 c16 c8 1 p[0,3] p[4,7] p[8,11] p[12,15] Fig Converting a 16-bit ripple-carry adder into a simple carry-skip adder with 4-bit skip blocks. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Another View of Carry-Skip Addition
1 4-bit block Street/freeway analogy for carry-skip adder. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Skip Carry Logic with OR Gate vs. Mux
Fig of arch book 1 p[4j, 4j+3] c4j+4 The carry-skip adder with “OR combining” works fine if we begin with a clean slate, where all signals are 0s at the outset; otherwise, it will run into problems, which do not exist in mux-based version Apr. 2012 Computer Arithmetic, Addition/Subtraction

Carry-Skip Adder with Fixed Block Size
Block width b; k/b blocks to form a k-bit adder (assume b divides k) Tfixed-skip-add = (b – 1) (k/b – 1) (b – 1) in block skips in last block  2b + k/b – 3 stages dT/db = 2 – k/b2 =  b opt = k/2 T opt = 22k – 3 Example: k = 32, b opt = 4, T opt = 13 stages (contrast with 32 stages for a ripple-carry adder) Apr. 2012 Computer Arithmetic, Addition/Subtraction

Carry-Skip Adder with Variable-Width Blocks
Fig Carry-skip adder with variable-size blocks and three sample carry paths. The total number of bits in the t blocks is k: 2[b + (b + 1) (b + t/2 – 1)] = t(b + t/4 – 1/2) = k b = k/t – t/4 + 1/2 Tvar-skip-add = 2(b – 1) + t – 1 = 2k/t + t/2 – 2 dT/db = –2k/t 2 + 1/2 = 0  t opt = 2k T opt = 2k – 2 (a factor of 2 smaller than for fixed-block) Apr. 2012 Computer Arithmetic, Addition/Subtraction

7.2 Multilevel Carry-Skip Adders
Fig Schematic diagram of a one-level carry-skip adder. Fig Example of a two-level carry-skip adder. Fig Two-level carry-skip adder optimized by removing the short-block skip circuits. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Designing a Single-Level Carry-Skip Adder
Example 7.1 Each of the following takes one unit of time: generation of gi and pi, generation of level-i skip signal from level-(i–1) skip signals, ripple, skip, and formation of sum bit once the incoming carry is known Build the widest possible one-level carry-skip adder with total delay of 8 Fig Timing constraints of a single-level carry-skip adder with a delay of 8 units. Max adder width = 18 ( ) Generalization of Example 7.1 for total time T (even or odd) T/2 T/ (T + 1)/ Thus, for any T, the total width is (T + 1)2/4 – 2 Apr. 2012 Computer Arithmetic, Addition/Subtraction

Designing a Two-Level Carry-Skip Adder
Example 7.2 Each of the following takes one unit of time: generation of gi and pi, generation of level-i skip signal from level-(i–1) skip signals, ripple, skip, and formation of sum bit once the incoming carry is known Build the widest possible two-level carry-skip adder with total delay of 8 (a) Initial timing constraints Max adder width = 30 ( ) (b) Final design Fig Two-level carry-skip adder with a delay of 8 units. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Elaboration on Two-Level Carry-Skip Adder
Example 7.2 Given the delay pair {b, a} for a level-2 block in Fig. 7.7a, the number of level-1 blocks that can be accommodated is g = min(b – 1, a) Single-level carry-skip adder with Tassimilate = a Single-level carry-skip adder with Tproduce = b Width of the ith level-1 block in the level-2 block characterized by {b, a} is bi = min(b – g + i + 1, a – i); the total block width is then i=0 to g–1 bi Apr. 2012 Computer Arithmetic, Addition/Subtraction

Carry-Skip Adder Optimization Scheme
Fig Generalized delay model for carry-skip adders. Apr. 2012 Computer Arithmetic, Addition/Subtraction

7.3 Carry-Select Adders k/2 – 1 k – 1 Fig Carry-select adder for k-bit numbers built from three k/2-bit adders. Cselect-add(k) = 3Cadd(k/2) + k/2 + 1 Tselect-add(k) = Tadd(k/2) + 1 Apr. 2012 Computer Arithmetic, Addition/Subtraction

Multilevel Carry-Select Adders
Fig Two-level carry-select adder built of k/4-bit adders. Apr. 2012 Computer Arithmetic, Addition/Subtraction

7.4 Conditional-Sum Adder
Multilevel carry-select idea carried out to the extreme (to 1-bit blocks. C(k)  2C(k/2) + k + 2  k (log2k + 2) + k C(1) T(k) = T(k/2) + 1 = log2k + T(1) where C(1) and T(1) are the cost and delay of the circuit of Fig for deriving the sum and carry bits with a carry-in of 0 and 1 k + 2 is an upper bound on number of single-bit 2-to-1 multiplexers needed for combining two k/2-bit adders into a k-bit adder Fig Top-level block for one bit position of a conditional-sum adder. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Conditional-Sum Addition Example
Table 7.2 Conditional-sum addition of two 16-bit numbers. The width of the block for which the sum and carry bits are known doubles with each additional level, leading to an addition time that grows as the logarithm of the word width k. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Elaboration on Conditional-Sum Addition
Two adjacent 4-bit blocks, forming an 8-bit block Left 4-bit block Right 4-bit block 8j j + 4 8j j Two versions of sum bits and carry-out in 4-bit blocks 1 1 1 8j j 8j 1 Two versions of sum bits and carry-out in 8-bit block Apr. 2012 Computer Arithmetic, Addition/Subtraction

7.5 Hybrid Designs and Optimizations
The most popular hybrid addition scheme: Fig A hybrid carry-lookahead/carry-select adder. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Details of a 64-Bit Hybrid CLA/Select Adder
Fig [Lync92]. Each of the carries c8j, produced by the tree network above is used to select one of the two versions of the sum in positions 8j to 8j + 7 Apr. 2012 Computer Arithmetic, Addition/Subtraction

Any Two Addition Schemes Can Be Combined
Fig Example 48-bit adder with hybrid ripple-carry/carry-lookahead design. Other possibilities: hybrid carry-select/ripple-carry hybrid ripple-carry/carry-select Apr. 2012 Computer Arithmetic, Addition/Subtraction

Optimizations in Fast Adders
What looks best at the block diagram or gate level may not be best when a circuit-level design is generated (effects of wire length, signal loading, ) Modern practice: Optimization at the transistor level Variable-block carry-lookahead adder Optimizations for average or peak power consumption Timing-based optimizations (next slide) Apr. 2012 Computer Arithmetic, Addition/Subtraction

Optimizations Based on Signal Timing
So far, we have assumed that all input bits are presented at the same time and all output bits are also needed simultaneously Fig Example arrival times for operand bits in the final fast adder of a tree multiplier [Oklo96]. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Modern Low-Power Adders Implemented in CMOS
64-Bit Adder Designs Cond’l-Sum Ling Three-Stage Ling Zeydel, Kluter, Oklobdzija, ARITH-17, 2005 Apr. 2012 Computer Arithmetic, Addition/Subtraction

Taxonomy of Parallel Prefix Networks
Fanout = 2f + 1 Logic levels = log2k + l Wire tracks = 2t From: Harris, David, 2003 Apr. 2012 Computer Arithmetic, Addition/Subtraction

7.6 Modular Two-Operand Adders
mod-2k: Ignore carry out of position k – 1 mod-(2k – 1): Use end-around carry because 2k = (2k – 1) + 1 mod-(2k + 1): Residue representation needs k + 1 bits Number 1 2 . 2k–1 2k Std. binary . Diminished-1 1 x x x x . x + y  2k + 1 iff (x–1) + (y–1) + 1  2k (x + y ) – 1 = (x – 1) + (y – 1) + 1 xy – 1 = (x–1)(y–1)+(x–1)+(y–1) Apr. 2012 Computer Arithmetic, Addition/Subtraction

General Modular Adders
x y (x + y) mod m if x + y  m then x + y – m else x + y –m Carry-Save Adder Adder Adder x + y x + y – m Mux Fig Fast modular addition. Sign bit (x + y) mod m Apr. 2012 Computer Arithmetic, Addition/Subtraction

8 Multioperand Addition
Chapter Goals Learn methods for speeding up the addition of several numbers (needed for multiplication or inner-product) Chapter Highlights Running total kept in redundant form Current total + Next number  New total Deferred carry assimilation Wallace/Dadda trees, parallel counters Modular multioperand addition Apr. 2012 Computer Arithmetic, Addition/Subtraction

Multioperand Addition: Topics
Topics in This Chapter 8.1 Using Two-Operand Adders 8.2 Carry-Save Adders 8.3 Wallace and Dadda Trees 8.4 Parallel Counters and Compressors 8.5 Adding Multiple Signed Numbers 8.6 Modular Multioperand Adders Apr. 2012 Computer Arithmetic, Addition/Subtraction

8.1 Using Two-Operand Adders
Some applications of multioperand addition Fig Multioperand addition problems for multiplication or inner-product computation in dot notation. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Serial Implementation with One Adder
Fig Serial implementation of multioperand addition with a single 2-operand adder. Tserial-multi-add = O(n log(k + log n)) = O(n log k + n log log n) Therefore, addition time grows superlinearly with n when k is fixed and logarithmically with k for a given n Apr. 2012 Computer Arithmetic, Addition/Subtraction

Pipelined Implementation for Higher Throughput
Problem to think about: Ignoring start-up and other overheads, this scheme achieves a speedup of 4 with 3 adders. How is this possible? Fig Serial multioperand addition when each adder is a 4-stage pipeline. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Parallel Implementation as Tree of Adders
log2n adder levels n – 1 adders Fig Adding 7 numbers in a binary tree of adders. Ttree-fast-multi-add = O(log k + log(k + 1) log(k + log2n – 1)) = O(log n log k + log n log log n) Ttree-ripple-multi-add = O(k + log n) [Justified on the next slide] Apr. 2012 Computer Arithmetic, Addition/Subtraction

Elaboration on Tree of Ripple-Carry Adders
Fig Ripple-carry adders at levels i and i + 1 in the tree of adders used for multi-operand addition. Ttree-ripple-multi-add = O(k + log n) The absolute best latency that we can hope for is O(log k + log n) There are kn data bits to process and using any set of computation elements with constant fan-in, this requires O(log(kn)) time We will see shortly that carry-save adders achieve this optimum time Apr. 2012 Computer Arithmetic, Addition/Subtraction

8.2 Carry-Save Adders Fig A ripple-carry adder turns into a carry-save adder if the carries are saved (stored) rather than propagated. Fig Carry-propagate adder (CPA) and carry-save adder (CSA) functions in dot notation. Fig Specifying full- and half-adder blocks, with their inputs and outputs, in dot notation. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Multioperand Addition Using Carry-Save Adders
Tcarry-save-multi-add = O(tree height + TCPA) = O(log n + log k) Ccarry-save-multi-add = (n – 2)CCSA + CCPA Fig Serial carry-save addition using a single CSA. Carry-propagate adder Fig Tree of carry-save adders reducing seven numbers to two. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Example Reduction by a CSA Tree
Bit position 2 = 12 FAs FAs FAs FAs + 1 HA bit adder --Carry-propagate adder-- Fig Representing a seven-operand addition in tabular form. A full-adder compacts 3 dots into 2 (compression ratio of 1.5) A half-adder rearranges 2 dots (no compression, but still useful) Fig Addition of seven 6-bit numbers in dot notation. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Width of Adders in a CSA Tree
Fig Adding seven k-bit numbers and the CSA/CPA widths required. Due to the gradual retirement (dropping out) of some of the result bits, CSA widths do not vary much as we go down the tree levels k+1 k k–1 1 3 2 4 Apr. 2012 Computer Arithmetic, Addition/Subtraction

8.3 Wallace and Dadda Trees
h levels Table 8.1 The maximum number n(h) of inputs for an h-level CSA tree –––––––––––––––––––––––––––––––––––– h n(h) h n(h) h n(h) n(h): Maximum number of inputs for h levels h(n) = 1 + h(2n/3) n(h) = 3n(h – 1)/2 2  1.5h–1< n(h)  2  1.5h Apr. 2012 Computer Arithmetic, Addition/Subtraction

Example Wallace and Dadda Reduction Trees
Wallace tree: Reduce the number of operands at the earliest possible opportunity Fig Adding seven 6-bit numbers using Dadda’s strategy. h n(h) 2 4 3 6 4 9 5 13 6 19 Dadda tree: Postpone the reduction to the extent possible without causing added delay Fig Addition of seven 6-bit numbers in dot notation. Apr. 2012 Computer Arithmetic, Addition/Subtraction

A Small Optimization in Reduction Trees
Fig Adding seven 6-bit numbers by taking advantage of the final adder’s carry-in. Fig Adding seven 6-bit numbers using Dadda’s strategy. Apr. 2012 Computer Arithmetic, Addition/Subtraction

8.4 Parallel Counters and Compressors
Fig A 10-input parallel counter also known as a (10; 4)-counter. 1-bit full-adder = (3; 2)-counter Circuit reducing 7 bits to their 3-bit sum = (7; 3)-counter Circuit reducing n bits to their log2(n + 1)-bit sum = (n; log2(n + 1))-counter Apr. 2012 Computer Arithmetic, Addition/Subtraction

Accumulative Parallel Counters
True generalization of sequential counters FA q-bit initial count x n increment signals vi, 2q–1 < n  2q q-bit tally of up to 2q – 1 of the increment signals Ignore, or use for decision q-bit final count y cq n increment signals vi q-bit final count y = x + Svi Parallel incrementer q-bit initial count x Count register Possible application: Compare Hamming weight of a vector to a constant Apr. 2012 Computer Arithmetic, Addition/Subtraction

Up/Down Parallel Counters
Generalization of up/down counters Possible application: Compare Hamming weights of two input vectors Apr. 2012 Computer Arithmetic, Addition/Subtraction

8.5 Generalized Parallel Counters
Fig Dot notation for a (5, 5; 4)-counter and the use of such counters for reducing five numbers to two numbers. Multicolumn reduction (5, 5; 4)-counter Unequal columns Gen. parallel counter = Parallel compressor (2, 3; 3)-counter Apr. 2012 Computer Arithmetic, Addition/Subtraction

A General Strategy for Column Compression
(n; 2)-counters Fig Schematic diagram of an (n; 2)-counter built of identical circuit slices n + y1 + y2 + y  y1 + 4y2 + 8y n – 3  y1 + 3y2 + 7y Example: Design a bit-slice of an (11; 2)-counter Solution: Let’s limit transfers to two stages. Then, 8  y1 + 3y2 Possible choices include y1 = 5, y2 = 1 or y1 = y2 = 2 Apr. 2012 Computer Arithmetic, Addition/Subtraction

(4; 2)-Counters Multicolumn4-to-2 reduction
[0, 5] = {0, 1} {0, 2} {0, 2} 4 dots and the incoming transfer Outgoing Sum and carry outputs We will discuss (4; 2)-counters in greater detail in Section 11.2 (see, e.g., Fig for an efficient realization) W Apr. 2012 Computer Arithmetic, Addition/Subtraction

8.5 Adding Multiple Signed Numbers
Extended positions Sign Magnitude positions xk–1 xk–1 xk–1 xk–1 xk–1 xk–1 xk–2 xk–3 xk– yk–1 yk–1 yk–1 yk–1 yk–1 yk–1 yk–2 yk–3 yk– zk–1 zk–1 zk–1 zk–1 zk–1 zk–1 zk–2 zk–3 zk– (a) Using sign extension xk–1' xk–2 xk–3 xk– yk–1' yk–2 yk–3 yk– zk–1' zk–2 zk–3 zk– 1 (b) Using negatively weighted bits –b = (1 – b) + 1 – 2 Fig Adding three 2's-complement numbers. Apr. 2012 Computer Arithmetic, Addition/Subtraction

8.6 Modular Multioperand Adders
(a) m = 2k Drop (b) m = 2k – 1 (c) m = 2k + 1 Invert Fig Modular carry-save addition with special moduli. Apr. 2012 Computer Arithmetic, Addition/Subtraction

Modular Reduction with Pseudoresidues
Six inputs in the range [0, 20] Fig Modulo-21 reduction of 6 numbers taking advantage of the fact that 64 = 1 mod 21 and using 6-bit pseudoresidues. Pseudoresidues in the range [0, 63] Add with end-around carry Final pseudoresidue (to be reduced) Apr. 2012 Computer Arithmetic, Addition/Subtraction

Part III Multiplication
28. Reconfigurable Arithmetic Appendix: Past, Present, and Future Apr. 2012 Computer Arithmetic, Multiplication

This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN ). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. © Behrooz Parhami Edition Released Revised First Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 May 2007 Apr. 2008 Apr. 2009 Second Apr. 2010 Apr. 2011 Apr. 2012 Apr. 2012 Computer Arithmetic, Multiplication

Computer Arithmetic, Multiplication
III Multiplication Review multiplication schemes and various speedup methods Multiplication is heavily used (in arith & array indexing) Division = reciprocation + multiplication Multiplication speedup: high-radix, tree, recursive Bit-serial, modular, and array multipliers Topics in This Part Chapter Basic Multiplication Schemes Chapter 10 High-Radix Multipliers Chapter 11 Tree and Array Multipliers Chapter 12 Variations in Multipliers Apr. 2012 Computer Arithmetic, Multiplication

“Well, well, for a rabbit, you’re not very good at multiplying, are you?” Apr. 2012 Computer Arithmetic, Multiplication

9 Basic Multiplication Schemes
Chapter Goals Study shift/add or bit-at-a-time multipliers and set the stage for faster methods and variations to be covered in Chapters 10-12 Chapter Highlights Multiplication = multioperand addition Hardware, firmware, software algorithms Multiplying 2’s-complement numbers The special case of one constant operand Apr. 2012 Computer Arithmetic, Multiplication

Basic Multiplication Schemes: Topics
Topics in This Chapter 9.1 Shift/Add Multiplication Algorithms 9.2 Programmed Multiplication 9.3 Basic Hardware Multipliers 9.4 Multiplication of Signed Numbers 9.5 Multiplication by Constants 9.6 Preview of Fast Multipliers Apr. 2012 Computer Arithmetic, Multiplication

9.1 Shift/Add Multiplication Algorithms
Notation for our discussion of multiplication algorithms: a Multiplicand ak–1ak– a1a0 x Multiplier xk–1xk– x1x0 p Product (a  x) p2k–1p2k– p3p2p1p0 Initially, we assume unsigned operands Fig Multiplication of two 4-bit unsigned binary numbers in dot notation. Apr. 2012 Computer Arithmetic, Multiplication

Multiplication Recurrence
Fig. 9.1 Preferred Multiplication with right shifts: top-to-bottom accumulation p(j+1) = (p(j) + xj a 2k) 2–1 with p(0) = 0 and |–––add–––| p(k) = p = ax + p(0)2–k |––shift right––| Multiplication with left shifts: bottom-to-top accumulation p(j+1) = 2 p(j) + xk–j–1a with p(0) = 0 and |shift| p(k) = p = ax + p(0)2k |––––add––––| Apr. 2012 Computer Arithmetic, Multiplication

Examples of Basic Multiplication
Right-shift algorithm Left-shift algorithm ======================== ======================= a a x x p(0) p(0) +x0a p(0) ––––––––––––––––––––––––– +x3a 2p(1) –––––––––––––––––––––––– p(1) p(1) +x1a p(1) ––––––––––––––––––––––––– +x2a 2p(2) –––––––––––––––––––––––– p(2) p(2) +x2a p(2) ––––––––––––––––––––––––– +x1a 2p(3) –––––––––––––––––––––––– p(3) p(3) +x3a p(3) ––––––––––––––––––––––––– +x0a 2p(4) –––––––––––––––––––––––– p(4) p(4) Fig Examples of sequential multipli-cation with right and left shifts. p(j+1) = (p(j) + xj a 2k) 2–1 |–––add–––| |––shift right––| Check: 10  11 = 110 = Apr. 2012 Computer Arithmetic, Multiplication

Examples of Basic Multiplication (Continued)
Right-shift algorithm Left-shift algorithm ======================== ======================= a a x x p(0) p(0) +x0a p(0) ––––––––––––––––––––––––– +x3a 2p(1) –––––––––––––––––––––––– p(1) p(1) +x1a p(1) ––––––––––––––––––––––––– +x2a 2p(2) –––––––––––––––––––––––– p(2) p(2) +x2a p(2) ––––––––––––––––––––––––– +x1a 2p(3) –––––––––––––––––––––––– p(3) p(3) +x3a p(3) ––––––––––––––––––––––––– +x0a 2p(4) –––––––––––––––––––––––– p(4) p(4) Fig Examples of sequential multipli-cation with right and left shifts. p(j+1) = 2 p(j) + xk–j–1a |shift| |––––add––––| Check: 10  11 = 110 = Apr. 2012 Computer Arithmetic, Multiplication

9.2 Programmed Multiplication
{Using right shifts, multiply unsigned m_cand and m_ier, storing the resultant 2k-bit product in p_high and p_low. Registers: R0 holds Rc for counter Ra for m_cand Rx for m_ier Rp for p_high Rq for p_low} {Load operands into registers Ra and Rx} mult: load Ra with m_cand load Rx with m_ier {Initialize partial product and counter} copy R0 into Rp copy R0 into Rq load k into Rc {Begin multiplication loop} m_loop: shift Rx right 1 {LSB moves to carry flag} branch no_add if carry = 0 add Ra to Rp {carry flag is set to cout} no_add: rotate Rp right 1 {carry to MSB, LSB to carry} rotate Rq right 1 {carry to MSB, LSB to carry} decr Rc {decrement counter by 1} branch m_loop if Rc  0 {Store the product} store Rp into p_high store Rq into p_low m_done: ... Fig Programmed multiplication (right-shift algorithm). Apr. 2012 Computer Arithmetic, Multiplication

Time Complexity of Programmed Multiplication
Assume k-bit words k iterations of the main loop 6-7 instructions per iteration, depending on the multiplier bit Thus, 6k + 3 to 7k + 3 machine instructions, ignoring operand loads and result store k = 32 implies 200+ instructions on average This is too slow for many modern applications! Microprogrammed multiply would be somewhat better Apr. 2012 Computer Arithmetic, Multiplication

9.3 Basic Hardware Multipliers
p(j+1) = (p(j) + xj a 2k) 2–1 |–––add–––| |––shift right––| Fig Hardware realization of the sequential multiplication algorithm with additions and right shifts. Apr. 2012 Computer Arithmetic, Multiplication

Example of Hardware Multiplication
1 1 0 (11)ten (110)ten (10)ten p(j+1) = (p(j) + xj a 2k) 2–1 |–––add–––| |––shift right––| Fig. 9.4a Hardware realization of the sequential multiplication algorithm with additions and right shifts. Apr. 2012 Computer Arithmetic, Multiplication

Performing Add and Shift in One Clock Cycle
Fig Combining the loading and shifting of the double-width register holding the partial product and the partially used multiplier. Apr. 2012 Computer Arithmetic, Multiplication

Sequential Multiplication with Left Shifts
Fig. 9.4b Hardware realization of the sequential multiplication algorithm with left shifts and additions. Apr. 2012 Computer Arithmetic, Multiplication

9.4 Multiplication of Signed Numbers
============================ a x p(0) +x0a ––––––––––––––––––––––––––––– 2p(1) p(1) +x1a 2p(2) p(2) +x2a 2p(3) p(3) +x3a 2p(4) p(4) +x4a 2p(5) p(5) Check: –10  11 = –110 = –512 + 256 + 128 + 16 + 2 Fig Sequential multiplication of 2’s-complement numbers with right shifts (positive multiplier). Negative multiplicand, positive multiplier: No change, other than looking out for proper sign extension Apr. 2012 Computer Arithmetic, Multiplication

The Case of a Negative Multiplier
============================ a x p(0) +x0a ––––––––––––––––––––––––––––– 2p(1) p(1) +x1a 2p(2) p(2) +x2a 2p(3) p(3) +x3a 2p(4) p(4) +(-x4a) 2p(5) p(5) Check: –10  –11 = 110 = Fig Sequential multiplication of 2’s-complement numbers with right shifts (negative multiplier). Negative multiplicand, negative multiplier: In last step (the sign bit), subtract rather than add Apr. 2012 Computer Arithmetic, Multiplication

Signed 2’s-Complement Hardware Multiplier
Adder k + 1 0, except in last cycle 1 Mux Enable Select Partial product Multiplier Multiplicand cin cout Fig The 2’s-complement sequential hardware multiplier. Apr. 2012 Computer Arithmetic, Multiplication

Booth’s Recoding Table Radix-2 Booth’s recoding ––––––––––––––––––––––––––––––––––––– xi xi–1 yi Explanation No string of 1s in sight End of string of 1s in x Beginning of string of 1s in x Continuation of string of 1s in x Example Operand x (1) Recoded version y Justification 2j + 2j– i+1 + 2i = 2j+1 – 2i Apr. 2012 Computer Arithmetic, Multiplication

Example Multiplication with Booth’s Recoding
============================ a x Multiplier y Booth-recoded p(0) +y0a ––––––––––––––––––––––––––––– 2p(1) p(1) +y1a 2p(2) p(2) +y2a 2p(3) p(3) +y3a 2p(4) p(4) y4a 2p(5) p(5) Example Multiplication with Booth’s Recoding Check: –10  –11 = 110 = Fig Sequential multiplication of 2’s-complement numbers with right shifts by means of Booth’s recoding. –––––––––– xi xi–1 yi Apr. 2012 Computer Arithmetic, Multiplication

9.5 Multiplication by Constants
1 2 . m – 1 Explicit, e.g. y := 12 * x + 1 Implicit, e.g. A[i, j] := A[i, j] + B[i, j] Address of A[i, j] = base + n * i + j Row i Column j Software aspects: Optimizing compilers replace multiplications by shifts/adds/subs Produce efficient code using as few registers as possible Find the best code by a time/space-efficient algorithm Hardware aspects: Synthesize special-purpose units such as filters y[t] = a0x[t] + a1x[t – 1] + a2x[t – 2] + b1y[t – 1] + b2y[t – 2] Apr. 2012 Computer Arithmetic, Multiplication

Multiplication Using Binary Expansion
Example: Multiply R1 by the constant 113 = ( )two R2  R1 shift-left 1 R3  R2 + R1 R6  R3 shift-left 1 R7  R6 + R1 R112  R7 shift-left 4 R113  R R1 Shift, add Shift Ri: Register that contains i times (R1) This notation is for clarity; only one register other than R1 is needed Shorter sequence using shift-and-add instructions R3  R1 shift-left R1 R7  R3 shift-left R1 R113  R7 shift-left R1 Apr. 2012 Computer Arithmetic, Multiplication

Multiplication via Recoding
Example: Multiply R1 by 113 = ( )two = ( )two R8  R1 shift-left 3 R7  R8 – R1 R112  R7 shift-left 4 R113  R R1 Shift, subtract Shift Shift, add Shorter sequence using shift-and-add/subtract instructions R7  R1 shift-left 3 – R1 R113  R7 shift-left R1 6 shift or add (3 shift-and-add) instructions needed without recoding The canonic signed-digit representation of a number contains no consecutive nonzero digits: average number of shift-adds is O(k/3) Apr. 2012 Computer Arithmetic, Multiplication

Multiplication via Factorization
Example: Multiply R1 by 119 = 7  17 = (8 – 1)  (16 + 1) R8  R1 shift-left 3 R7  R8 – R1 R112  R7 shift-left 4 R119  R R7 Shorter sequence using shift-and-add/subtract instructions R7  R1 shift-left 3 – R1 R119  R7 shift-left R7 Requires a scratch register for holding the 7 multiple 119 = ( )two = ( )two More instructions may be needed without factorization Apr. 2012 Computer Arithmetic, Multiplication

Multiplication by Multiple Constants
Example: Multiplying a number by 45, 49, and 65 R9  R1 shift-left R1 R45  R9 shift-left R9 R7  R1 shift-left 3 – R1 R49  R7 shift-left 3 – R7 R65  R1 shift-left R1 Separate solutions: 5 shift-add/subtract operations A combined solution for all three constants R65  R1 shift-left R1 R49  R65 – R1 left-shift 4 R45  R49 – R1 left-shift 2 A programmable block can perform any of the three multiplications Apr. 2012 Computer Arithmetic, Multiplication

9.6 Preview of Fast Multipliers
Viewing multiplication as a multioperand addition problem, there are but two ways to speed it up a. Reducing the number of operands to be added: Handling more than one multiplier bit at a time (high-radix multipliers, Chapter 10) b. Adding the operands faster: Parallel/pipelined multioperand addition (tree and array multipliers, Chapter 11) In Chapter 12, we cover all remaining multiplication topics: Bit-serial multipliers Modular multipliers Multiply-add units Squaring as a special case Apr. 2012 Computer Arithmetic, Multiplication

10 High-Radix Multipliers
Chapter Goals Study techniques that allow us to handle more than one multiplier bit in each cycle (two bits in radix 4, three in radix 8, . . .) Chapter Highlights High radix gives rise to “difficult” multiples Recoding (change of digit-set) as remedy Carry-save addition reduces cycle time Implementation and optimization methods Apr. 2012 Computer Arithmetic, Multiplication

High-Radix Multipliers: Topics
Topics in This Chapter 10.1 Radix-4 Multiplication 10.2 Modified Booth’s Recoding 10.3 Using Carry-Save Adders 10.4 Radix-8 and Radix-16 Multipliers 10.5 Multibeat Multipliers 10.6 VLSI Complexity Issues Apr. 2012 Computer Arithmetic, Multiplication

10.1 Radix-4 Multiplication
x0 a r 0 x1 a r 1 x2 a r 2 x3 a r 3 Fig. 9.1 (modified) Preferred Multiplication with right shifts in radix r : top-to-bottom accumulation p(j+1) = (p(j) + xj a r k) r –1 with p(0) = 0 and |–––add–––| p(k) = p = ax + p(0)r –k |––shift right––| Multiplication with left shifts in radix r : bottom-to-top accumulation p(j+1) = r p(j) + xk–j–1a with p(0) = 0 and |shift| p(k) = p = ax + p(0)r k |––––add––––| Apr. 2012 Computer Arithmetic, Multiplication

Radix-4 Multiplication in Dot Notation
Fig Radix-4, or two-bit-at-a-time, multiplication in dot notation Fig. 9.1 Number of cycles is halved, but now the “difficult” multiple 3a must be dealt with Apr. 2012 Computer Arithmetic, Multiplication

A Possible Design for a Radix-4 Multiplier
Precomputed via shift-and-add (3a = 2a + a) k/2 + 1 cycles, rather than k One extra cycle over k/2 not too bad, but we would like to avoid it if possible Solving this problem for radix 4 may also help when dealing with even higher radices Fig The multiple generation part of a radix-4 multiplier with precomputation of 3a. Apr. 2012 Computer Arithmetic, Multiplication

Example Radix-4 Multiplication Using 3a
================================ a 3a x p(0) +(x1x0)twoa ––––––––––––––––––––––––––––––––– 4p(1) p(1) +(x3x2)twoa 4p(2) p(2) Fig Example of radix-4 multiplication using the 3a multiple. Apr. 2012 Computer Arithmetic, Multiplication

A Second Design for a Radix-4 Multiplier
Fig The multiple generation part of a radix-4 multiplier based on replacing 3a with 4a (carry into next higher radix-4 multiplier digit) and –a. xi+1(xi  c) xi+1  xi c xi  c c xi+1 xi c Mux control Set carry Apr. 2012 Computer Arithmetic, Multiplication

10.2 Modified Booth’s Recoding
Table Radix-4 Booth’s recoding yielding (zk/ z1z0)four ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– xi+1 xi xi–1 yi+1 yi zi/2 Explanation No string of 1s in sight End of string of 1s Isolated 1 End of string of 1s Beginning of string of 1s End a string, begin new one Beginning of string of 1s Continuation of string of 1s Context Recoded radix-2 digits Radix-4 digit Example Operand x (1) Recoded version y (1) Radix-4 version z Apr. 2012 Computer Arithmetic, Multiplication

Example Multiplication via Modified Booth’s Recoding
================================ a x z Radix-4 p(0) +z0a ––––––––––––––––––––––––––––––––– 4p(1) p(1) +z1a 4p(2) p(2) Fig Example of radix-4 multiplication with modified Booth’s recoding of the 2’s-complement multiplier. Apr. 2012 Computer Arithmetic, Multiplication

Multiple Generation with Radix-4 Booth’s Recoding
Sign of a ---- Encoding ---- Digit neg two non0 – – Could have named this signal one/two Fig The multiple generation part of a radix-4 multiplier based on Booth’s recoding. Apr. 2012 Computer Arithmetic, Multiplication

10.3 Using Carry-Save Adders
Fig Radix-4 multiplication with a carry-save adder used to combine the cumulative partial product, xia, and 2xi+1a into two numbers. Apr. 2012 Computer Arithmetic, Multiplication

Keeping the Partial Product in Carry-Save Form
Upper half of PP Lower half of PP Right shift Sum Carry (a) Multiplier block diagram (b) Operation in a typical cycle Fig Radix-2 multiplication with the upper half of the cumulative partial product kept in stored-carry form. Apr. 2012 Computer Arithmetic, Multiplication

Carry-Save Multiplier with Radix-4 Booth’s Recoding
Fig Radix-4 multiplication with a CSA used to combine the stored-carry cumulative partial product and zi/2a into two numbers. Apr. 2012 Computer Arithmetic, Multiplication

Radix-4 Booth’s Recoding for Parallel Multiplication
Fig Booth recoding and multiple selection logic for high-radix or parallel multiplication. Apr. 2012 Computer Arithmetic, Multiplication

Yet Another Design for Radix-4 Multiplication
Fig Radix-4 multiplication, with the cumulative partial product, xia, and 2xi+1a combined into two numbers by two CSAs. (4; 2)-counter Apr. 2012 Computer Arithmetic, Multiplication

10.4 Radix-8 and Radix-16 Multipliers
4-bit right shift Fig Radix-16 multiplication with the upper half of the cumulative partial product in carry-save form. Apr. 2012 Computer Arithmetic, Multiplication

Other High-Radix Multipliers
Remove this mux & CSA and replace the 4-bit shift (adder) with a 3-bit shift (adder) to get a radix-8 multiplier (cycle time will remain the same, though) A radix-16 multiplier design becomes a radix-256 multiplier if radix-4 Booth’s recoding is applied first (the muxes are replaced by Booth recoding and multiple selection logic) Fig Apr. 2012 Computer Arithmetic, Multiplication

A Spectrum of Multiplier Design Choices
Fig High-radix multipliers as intermediate between sequential radix-2 and full-tree multipliers. Apr. 2012 Computer Arithmetic, Multiplication

10.5 Multibeat Multipliers
Fig Two-phase clocking for sequential logic. Begin changing FF contents Change becomes visible at FF output Observation: Half of the clock cycle goes to waste Once cycle Apr. 2012 Computer Arithmetic, Multiplication

Twin-Beat and Three-Beat Multipliers
This radix-64 multiplier runs at the clock rate of a radix-8 design (2X speed) Fig Conceptual view of a three-beat multiplier. Fig Twin-beat multiplier with radix-8 Booth’s recoding. Apr. 2012 Computer Arithmetic, Multiplication

10.6 VLSI Complexity Issues
A radix-2b multiplier requires: bk two-input AND gates to form the partial products bit-matrix O(bk) area for the CSA tree At least Q(k) area for the final carry-propagate adder Total area: A = O(bk) Latency: T = O((k/b) log b + log k) Any VLSI circuit computing the product of two k-bit integers must satisfy the following constraints: AT grows at least as fast as k3/2 AT2 is at least proportional to k2 The preceding radix-2b implementations are suboptimal, because: AT = O(k2 log b + bk log k) AT2 = O((k3/b) log2b) Apr. 2012 Computer Arithmetic, Multiplication

Comparing High- and Low-Radix Multipliers
AT = O(k2 log b + bk log k) AT 2 = O((k3/b) log2b) Low-Cost b = O(1) High Speed b = O(k) AT- or AT 2- Optimal AT O(k 2) O(k 2 log k) O(k 3/2) AT 2 O(k 3) O(k 2 log2 k) Intermediate designs do not yield better AT or AT 2 values; The multipliers remain asymptotically suboptimal for any b By the AT measure (indicator of cost-effectiveness), slower radix-2 multipliers are better than high-radix or tree multipliers Thus, when an application requires many independent multiplications, it is more cost-effective to use a large number of slower multipliers High-radix multiplier latency can be reduced from O((k/b) log b + log k) to O(k/b + log k) through more effective pipelining (Chapter 11) Apr. 2012 Computer Arithmetic, Multiplication

11 Tree and Array Multipliers
Chapter Goals Study the design of multipliers for highest possible performance (speed, throughput) Chapter Highlights Tree multiplier = reduction tree + redundant-to-binary converter Avoiding full sign extension in multiplying signed numbers Array multiplier = one-sided reduction tree + ripple-carry adder Apr. 2012 Computer Arithmetic, Multiplication

Tree and Array Multipliers: Topics
Topics in This Chapter Full-Tree Multipliers Alternative Reduction Trees Tree Multipliers for Signed Numbers Partial-Tree and Truncated Multipliers Array Multipliers Pipelined Tree and Array Multipliers Apr. 2012 Computer Arithmetic, Multiplication

11.1 Full-Tree Multipliers
Fig General structure of a full-tree multiplier. Fig High-radix multipliers as intermediate between sequential radix-2 and full-tree multipliers. Apr. 2012 Computer Arithmetic, Multiplication

Full-Tree versus Partial-Tree Multiplier
Schematic diagrams for full-tree and partial-tree multipliers. Apr. 2012 Computer Arithmetic, Multiplication

Variations in Full-Tree Multiplier Design
1. Multiple-forming circuits Designs are distinguished by variations in three elements: 2. Partial products reduction tree Fig. 11.1 3. Redundant-to-binary converter Apr. 2012 Computer Arithmetic, Multiplication

Example of Variations in CSA Tree Design
Fig Two different binary 4  4 tree multipliers. HA 3 FA Corrections shown in red 2 Apr. 2012 Computer Arithmetic, Multiplication

Details of a CSA Tree Fig Possible CSA tree for a 7  7 tree multiplier. CSA trees are quite irregular, causing some difficulties in VLSI realization Thus, our motivation to examine alternate methods for partial products reduction Apr. 2012 Computer Arithmetic, Multiplication

11.2 Alternative Reduction Trees
FA Level 1 5 4 3 2 11 + y1 = 2y1 + 3 Therefore, y1 = 8 carries are needed Fig. 11.4 A slice of a balanced-delay tree for 11 inputs. Apr. 2012 Computer Arithmetic, Multiplication

Binary Tree of 4-to-2 Reduction Modules
(a) Binary tree of (4; 2)-counters 4-to-2 (b) Realization with FAs (c) A faster realization FA c s c s 4-to-2 compressor Fig Tree multiplier with a more regular structure based on 4-to-2 reduction modules. Due to its recursive structure, a binary tree is more regular than a 3-to-2 reduction tree when laid out in VLSI Apr. 2012 Computer Arithmetic, Multiplication

Example Multiplier with 4-to-2 Reduction Tree
Even if 4-to-2 reduction is implemented using two CSA levels, design regularity potentially makes up for the larger number of logic levels Similarly, using Booth’s recoding may not yield any advantage, because it introduces irregularity Fig Layout of a partial-products reduction tree composed of 4-to-2 reduction modules. Each solid arrow represents two numbers. Apr. 2012 Computer Arithmetic, Multiplication

11.3 Tree Multipliers for Signed Numbers
Extended positions Sign Magnitude positions xk–1 xk–1 xk–1 xk–1 xk–1 xk–1 xk–2 xk–3 xk– yk–1 yk–1 yk–1 yk–1 yk–1 yk–1 yk–2 yk–3 yk– zk–1 zk–1 zk–1 zk–1 zk–1 zk–1 zk–2 zk–3 zk– From Fig. 8.19a Sign extension in multioperand addition. The difference in multiplication is the shifting sign positions Fig Sharing of full adders to reduce the CSA width in a signed tree multiplier. Apr. 2012 Computer Arithmetic, Multiplication

Using the Negative-Weight Property of the Sign Bit
Sign extension is a way of converting negatively weighted bits (negabits) to positively weighted bits (posibits) to facilitate reduction, but there are other methods of accomplishing the same without introducing a lot of extra bits Baugh and Wooley have contributed two such methods Fig Baugh-Wooley 2’s-complement multiplication. Apr. 2012 Computer Arithmetic, Multiplication

The Baugh-Wooley Method and Its Modified Form
Fig. 11.8 –a4x0 = a4(1 – x0) – a4 = a4x0 – a4 –a a4x0 a4 In next column –a4x0 = (1 – a4x0) – 1 = (a4x0) – 1 – (a4x0) 1 In next column Apr. 2012 Computer Arithmetic, Multiplication

Alternate Views of the Baugh-Wooley Methods
–a4x3 –a4x2 –a4x1 –a4x0 –a3x4 –a2x4 –a1x4 –a0x4 – a4x3 a4x2 a4x1 a4x0 – a3x4 a2x4 a1x4 a0x4 a4x3 a4x2 a4x1 a4x0 a3x4 a2x4 a1x4 a0x4 1 + a4 a4 a4x3 a4x2 a4x1 a4x0 + x4 x a3x4 a2x4 a1x4 a0x4 a4 x4 a4 x4 Apr. 2012 Computer Arithmetic, Multiplication

11.4 Partial-Tree and Truncated Multipliers
High-radix versus partial-tree multipliers: The difference is quantitative, not qualitative For small h, say  8 bits, we view the multiplier of Fig as high-radix When h is a significant fraction of k, say k/2 or k/4, then we tend to view it as a partial-tree multiplier Better design through pipelining to be covered in Section 11.6 Fig General structure of a partial-tree multiplier. Apr. 2012 Computer Arithmetic, Multiplication

Why Truncated Multipliers?
Nearly half of the hardware in array/tree multipliers is there to get the last bit right (1 dot = one FPGA cell) ulp .         k-by-k fractional  .         multiplication .       | .      |  .     |   .    |      |     .  |      |       |        .        |        Max error = 8/2 + 7/4 + 6/8 + 5/16 + 4/32 + 3/64 + 2/128 + 1/256 = ulp Mean error = 1.751 ulp Fig The idea of a truncated multiplier with 8-bit fractional operands. Apr. 2012 Computer Arithmetic, Multiplication

Truncated Multipliers with Error Compensation
We can introduce additional “dots” on the left-hand side to compensate for the removal of dots from the right-hand side Constant compensation Variable compensation . o o o o o o o| . o o o o o o o| . o o o o o o| o o o o o o| o o o o o| o o o o o| o o o o| o o o o| o o o| o o o| o o| o o| o| x-1o| | y-1 | Constant and variable error compensation for truncated multipliers. Max error = +4 ulp Max error  -3 ulp Max error = +? ulp Max error  -? ulp Mean error = ? ulp Mean error = ? ulp Apr. 2012 Computer Arithmetic, Multiplication

11.5 Array Multipliers Fig A basic array multiplier uses a one-sided CSA tree and a ripple-carry adder. Fig Details of a 5  5 array multiplier using FA blocks. Apr. 2012 Computer Arithmetic, Multiplication

Signed (2’s-complement) Array Multiplier
Fig Modifications in a 5  5 array multiplier to deal with 2’s-complement inputs using the Baugh-Wooley method or to shorten the critical path. Apr. 2012 Computer Arithmetic, Multiplication

Array Multiplier Built of Modified Full-Adder Cells
Fig Design of a 5  5 array multiplier with two additive inputs and full-adder blocks that include AND gates. FA Apr. 2012 Computer Arithmetic, Multiplication

Array Multiplier without a Final Carry-Propagate Adder
Fig Conceptual view of a modified array multiplier that does not need a final carry-propagate adder. Fig Carry-save addition, performed in level i, extends the conditionally computed bits of the final product. All remaining bits of the final product produced only 2 gate levels after pk–1 Apr. 2012 Computer Arithmetic, Multiplication

11.6 Pipelined Tree and Array Multipliers
Fig General structure of a partial-tree multiplier. Fig Efficiently pipelined partial-tree multiplier. Apr. 2012 Computer Arithmetic, Multiplication

Pipelined Array Multipliers
With latches after every FA level, the maximum throughput is achieved Latches may be inserted after every h FA levels for an intermediate design Example: 3-stage pipeline Fig Pipelined 5  5 array multiplier using latched FA blocks. The small shaded boxes are latches. Apr. 2012 Computer Arithmetic, Multiplication

12 Variations in Multipliers
Chapter Goals Learn additional methods for synthesizing fast multipliers as well as other types of multipliers (bit-serial, modular, etc.) Chapter Highlights Building a multiplier from smaller units Performing multiply-add as one operation Bit-serial and (semi)systolic multipliers Using a multiplier for squaring is wasteful Apr. 2012 Computer Arithmetic, Multiplication

Variations in Multipliers: Topics
Topics in This Chapter 12.1 Divide-and-Conquer Designs 12.2 Additive Multiply Modules 12.3 Bit-Serial Multipliers 12.4 Modular Multipliers 12.5 The Special Case of Squaring 12.6 Combined Multiply-Add Units Apr. 2012 Computer Arithmetic, Multiplication

12.1 Divide-and-Conquer Designs
Building wide multiplier from narrower ones Fig Divide-and-conquer (recursive) strategy for synthesizing a 2b  2b multiplier from b  b multipliers. Apr. 2012 Computer Arithmetic, Multiplication

General Structure of a Recursive Multiplier
2b  2b use (3; 2)-counters 3b  3b use (5; 2)-counters 4b  4b use (7; 2)-counters Fig Using b  b multipliers to synthesize 2b  2b, 3b  3b, and 4b  4b multipliers. Apr. 2012 Computer Arithmetic, Multiplication

Using b  c, rather than b  b Building Blocks
2b  2c use b  c multipliers and (3; 2)-counters 2b  4c use b  c multipliers and (5?; 2)-counters gb  hc use b  c multipliers and (?; 2)-counters Apr. 2012 Computer Arithmetic, Multiplication

Wide Multiplier Built of Narrow Multipliers and Adders
Fig Using 4  4 multipliers and 4-bit adders to synthesize an 8  8 multiplier. Apr. 2012 Computer Arithmetic, Multiplication

Karatsuba Multiplication
2b  2b multiplication requires four b  b multiplications: (2baH + aL)  (2bxH + xL) = 22baHxH + 2b (aHxL + aLxH) + aLxL Karatsuba noted that one of the four multiplications can be removed at the expense of introducing a few additions: (2baH + aL)  (2bxH + xL) = 22baHxH + 2b [(aH + aL)  (xH + xL) – aHxH – aLxL] + aLxL b bits Mult 1 Mult 3 Mult 2 aH aL xH xL Benefit is quite significant for extremely wide operands (4/3)5 = 4.2 (4/3)10 = (4/3)20 = (4/3)50 = 1,765,781 Apr. 2012 Computer Arithmetic, Multiplication

12.2 Additive Multiply Modules
Fig Additive multiply module with 2  4 multiplier (ax) plus 4-bit and 2-bit additive inputs (y and z). b-bit and c-bit multiplicative inputs b  c AMM b-bit and c-bit additive inputs (b + c)-bit output (2b – 1)  (2c – 1) + (2b – 1) + (2c – 1) = 2b+c – 1 Apr. 2012 Computer Arithmetic, Multiplication

Multiplier Built of AMMs
Understanding an 8  8 multiplier built of 4  2 AMMs using dot notation Fig An 8  8 multiplier built of 4  2 AMMs. Inputs marked with an asterisk carry 0s. Apr. 2012 Computer Arithmetic, Multiplication

Multiplier Built of AMMs: Alternate Design
This design is more regular than that in Fig and is easily expandable to larger configurations; its latency, however, is greater Fig Alternate 8  8 multiplier design based on 4  2 AMMs. Inputs marked with an asterisk carry 0s. Apr. 2012 Computer Arithmetic, Multiplication

12.3 Bit-Serial Multipliers
Bit-serial adder (LSB first) … FA FF x2 y2 s2 x1 y1 s1 x0 y0 s0 Bit-serial multiplier a1 x1 p1 a0 x0 p0 a2 x2 p2 … ? (Must follow the k-bit inputs with k 0s; alternatively, view the product as being only k bits wide) What goes inside the box to make a bit-serial multiplier? Can the circuit be designed to support a high clock rate? Apr. 2012 Computer Arithmetic, Multiplication

Semisystolic Serial-Parallel Multiplier
Fig Semi-systolic circuit for 4  4 multiplication in 8 clock cycles. This is called “semisystolic” because it has a large signal fan-out of k (k-way broadcasting) and a long wire spanning all k positions Apr. 2012 Computer Arithmetic, Multiplication

Systolic Retiming as a Design Tool
A semisystolic circuit can be converted to a systolic circuit via retiming, which involves advancing and retarding signals by means of delay removal and delay insertion in such a way that the relative timings of various parts are unaffected Fig Example of retiming by delaying the inputs to CL and advancing the outputs from CL by d units Apr. 2012 Computer Arithmetic, Multiplication

Alternate Explanation of Systolic Retiming
t + a d1 d2 t+a+d1+d2 t d1 d2 t+a+d1+d2 t+d1 t+a+d1 Transferring delay from the outputs of a subsystem to its inputs does not change the behavior of the overall system Apr. 2012 Computer Arithmetic, Multiplication

A First Attempt at Retiming
Fig. 12.7 Fig A retimed version of our semi-systolic multiplier. Apr. 2012 Computer Arithmetic, Multiplication

Deriving a Fully Systolic Multiplier
Fig. 12.7 Fig Systolic circuit for 4  4 multiplication in 15 cycles. Apr. 2012 Computer Arithmetic, Multiplication

A Direct Design for a Bit-Serial Multiplier
Fig Building block for a latency-free bit-serial multiplier. Fig The cellular structure of the bit-serial multiplier based on the cell in Fig Fig Bit-serial multiplier design in dot notation. Apr. 2012 Computer Arithmetic, Multiplication

12.4 Modular Multipliers Fig Modulo-(2b – 1) carry-save adder. Fig Design of a 4  4 modulo-15 multiplier. Apr. 2012 Computer Arithmetic, Multiplication

Other Examples of Modular Multiplication
Fig One way to design of a 4  4 modulo-13 multiplier. Fig A method for modular multioperand addition. Apr. 2012 Computer Arithmetic, Multiplication

12.5 The Special Case of Squaring
x1x0 –x1x0 Fig Design of a 5-bit squarer. Apr. 2012 Computer Arithmetic, Multiplication

Divide-and-Conquer Squarers
Building wide squarers from narrower ones xH xL xL xH xL xH xL xH xL xH Divide-and-conquer (recursive) strategy for synthesizing a 2b  2b squarer from b  b squarers and multiplier. Apr. 2012 Computer Arithmetic, Multiplication

12.6 Combined Multiply-Add Units
Additive input CSA tree output Multiply-add versus multiply-accumulate Multiply-accumulate units often have wider additive inputs (b) Carry-save additive input CSA tree output (c) Additive input Dot matrix for the 4  4 multiplication Fig Dot-notation representations of various methods for performing a multiply-add operation in hardware. (d) Carry-save additive input Dot matrix for the 4  4 multiplication Apr. 2012 Computer Arithmetic, Multiplication

Part IV Division May 2012 Computer Arithmetic, Division
28. Reconfigurable Arithmetic Appendix: Past, Present, and Future May 2012 Computer Arithmetic, Division

This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN ). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. © Behrooz Parhami Edition Released Revised First Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 May 2007 May 2008 May 2009 Second May 2010 Apr. 2011 May 2012 May 2012 Computer Arithmetic, Division

Computer Arithmetic, Division
IV Division Review Division schemes and various speedup methods Hardest basic operation (fortunately, also the rarest) Division speedup methods: high-radix, array, . . . Combined multiplication / division hardware Digit-recurrence vs convergence division schemes Topics in This Part Chapter 13 Basic Division Schemes Chapter 14 High-Radix Dividers Chapter 15 Variations in Dividers Chapter 16 Division by Convergence May 2012 Computer Arithmetic, Division

Be fruitful and multiply . . . Now, divide. May 2012 Computer Arithmetic, Division

13 Basic Division Schemes
Chapter Goals Study shift/subtract or bit-at-a-time dividers and set the stage for faster methods and variations to be covered in Chapters 14-16 Chapter Highlights Shift/subtract divide vs shift/add multiply Hardware, firmware, software algorithms Dividing 2’s-complement numbers The special case of a constant divisor May 2012 Computer Arithmetic, Division

Basic Division Schemes: Topics
Topics in This Chapter 13.1 Shift/Subtract Division Algorithms 13.2 Programmed Division 13.3 Restoring Hardware Dividers 13.4 Nonrestoring and Signed Division 13.5 Division by Constants 13.6 Radix-2 SRT Division May 2012 Computer Arithmetic, Division

13.1 Shift/Subtract Division Algorithms
Notation for our discussion of division algorithms: z Dividend z2k–1z2k– z3z2z1z0 d Divisor dk–1dk– d1d0 q Quotient qk–1qk– q1q0 s Remainder, z – (d  q) sk–1sk– s1s0 Initially, we assume unsigned operands Fig Division of an 8-bit number by a 4-bit number in dot notation. May 2012 Computer Arithmetic, Division

Division versus Multiplication
Division is more complex than multiplication: Need for quotient digit selection or estimation Overflow possibility: the high-order k bits of z must be strictly less than d; this overflow check also detects the divide-by-zero condition. Pentium III latencies Instruction Latency Cycles/Issue Load / Store Integer Multiply Integer Divide Double/Single FP Multiply Double/Single FP Add Double/Single FP Divide The ratios haven’t changed much in later Pentiums, Atom, or AMD products* *Source: T. Granlund, “Instruction Latencies and Throughput for AMD and Intel x86 Processors,” Feb. 2012 May 2012 Computer Arithmetic, Division

Division Recurrence k bits 2z 2k d Fig. 13.1 Division with left shifts s(j) = 2s(j–1) – qk–j (2k d) with s(0) = z and |–shift–| s(k) = 2ks |––– subtract –––| (There is no corresponding right-shift algorithm) Integer division is characterized by z = d  q + s 2–2kz = (2–kd)  (2–kq) + 2–2ks zfrac = dfrac  qfrac + 2–k sfrac Divide fractions like integers; adjust the remainder No-overflow condition for fractions is: zfrac < dfrac May 2012 Computer Arithmetic, Division

Examples of Basic Division
117 Decimal Integer division Fractional division ====================== ===================== z zfrac 24d dfrac s(0) s(0) 2s(0) s(0) –q3 24d {q3 = 1} –q–1d {q–1=1} ––––––––––––––––––––––– –––––––––––––––––––––– s(1) s(1) 2s(1) s(1) –q2 24d {q2 = 0} –q–2d {q–2=0} s(2) s(2) 2s(2) s(2) –q1 24d {q1 = 1} –q–3d {q–3=1} s(3) s(3) 2s(3) s(3) –q0 24d {q0 = 1} –q–4d {q–4=1} s(4) s(4) s sfrac q qfrac Fig Examples of sequential division with integer and fractional operands. 10 7 11 May 2012 Computer Arithmetic, Division

13.2 Programmed Division Fig Register usage for programmed division. May 2012 Computer Arithmetic, Division

Assembly Language Program for Division
{Using left shifts, divide unsigned 2k-bit dividend, z_high|z_low, storing the k-bit quotient and remainder. Registers: R0 holds Rc for counter Rd for divisor Rs for z_high & remainder Rq for z_low & quotient} {Load operands into registers Rd, Rs, and Rq} div: load Rd with divisor load Rs with z_high load Rq with z_low {Check for exceptions} branch d_by_0 if Rd = R0 branch d_ovfl if Rs > Rd {Initialize counter} load k into Rc {Begin division loop} d_loop: shift Rq left 1 {zero to LSB, MSB to carry} rotate Rs left 1 {carry to LSB, MSB to carry} skip if carry = 1 branch no_sub if Rs < Rd sub Rd from Rs incr Rq {set quotient digit to 1} no_sub: decr Rc {decrement counter by 1} branch d_loop if Rc  0 {Store the quotient and remainder} store Rq into quotient store Rs into remainder d_by_0: ... d_ovfl: ... d_done: ... Fig Register usage for programmed division. Fig Programmed division using left shifts. May 2012 Computer Arithmetic, Division

Time Complexity of Programmed Division
Assume k-bit words k iterations of the main loop 6-8 instructions per iteration, depending on the quotient bit Thus, 6k + 3 to 8k + 3 machine instructions, ignoring operand loads and result store k = 32 implies 220+ instructions on average This is too slow for many modern applications! Microprogrammed division would be somewhat better May 2012 Computer Arithmetic, Division

13.3 Restoring Hardware Dividers
k bits 2z 2k d In 2’s-complement arithmetic, adding a negative value to a positive value produces cout = 1 if the result is positive Fig Shift/subtract sequential restoring divider. May 2012 Computer Arithmetic, Division

Example of Restoring Unsigned Division
======================= z 24d –24d s(0) 2s(0) +(–24d) –––––––––––––––––––––––– s(1) Positive, so set q3 = 1 2s(1) s(2) Negative, so set q2 = 0 s(2)=2s(1) and restore 2s(2) s(3) Positive, so set q1 = 1 2s(3) s(4) Positive, so set q0 = 1 s q ======================= No overflow, because (0111)two < (1010)two Fig Example of restoring unsigned division. May 2012 Computer Arithmetic, Division

Indirect Signed Division
In division with signed operands, q and s are defined by z = d  q + s sign(s) = sign(z) |s | < |d | Examples of division with signed operands z = 5 d = 3  q = 1 s = 2 z = 5 d = –3  q = –1 s = 2 z = –5 d = 3  q = –1 s = –2 z = –5 d = –3  q = 1 s = –2 Magnitudes of q and s are unaffected by input signs Signs of q and s are derivable from signs of z and d (not q = –2, s = –1) Will discuss direct signed division later May 2012 Computer Arithmetic, Division

13.4 Nonrestoring and Signed Division
The cycle time in restoring division must accommodate: Shifting the registers Allowing signals to propagate through the adder Determining and storing the next quotient digit Storing the trial difference, if required Later events depend on earlier ones in the same cycle, causing a lengthening of the clock cycle Nonrestoring division to the rescue! Assume qk–j = 1 and subtract Store the result as the new PR (the partial remainder can become incorrect, hence the name “nonrestoring”) May 2012 Computer Arithmetic, Division

Justification for Nonrestoring Division
Why it is acceptable to store an incorrect value in the partial-remainder register? Shifted partial remainder at start of the cycle is u Suppose subtraction yields the negative result u – 2kd Option 1: Restore the partial remainder to correct value u, shift left, and subtract to get 2u – 2kd Option 2: Keep the incorrect partial remainder u – 2kd, shift left, and add to get 2(u – 2kd) + 2kd = 2u – 2kd May 2012 Computer Arithmetic, Division

Example of Nonrestoring Unsigned Division
117 Decimal ======================= z 24d –24d s(0) 2s(0) Positive, +(–24d) so subtract –––––––––––––––––––––––– s(1) 2s(1) Positive, so set q3 = 1 +(–24d) and subtract s(2) 2s(2) Negative, so set q2 = 0 +24d and add s(3) 2s(3) Positive, so set q1 = 1 s(4) Positive, so set q0 = 1 s q ======================= No overflow: (0111)two < (1010)two 10  16 7 11 Fig Example of nonrestoring unsigned division. May 2012 Computer Arithmetic, Division

Graphical Depiction of Nonrestoring Division
Example ( )two / ( )two (117)ten / (10)ten Fig Partial remainder variations for restoring and nonrestoring division. May 2012 Computer Arithmetic, Division

Convergence of the Partial Quotient to q
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Partial quotient Iteration 1 2 3 4 q q(1) q(2) q(3) q(4) Restoring Nonrestoring q(0) Example ( )two / ( )two (117)ten/(10)ten = (11)ten = (1011)two In restoring division, the partial quotient converges to q from below In nonrestoring division, the partial quotient may overshoot q, but converges to it after some oscillations May 2012 Computer Arithmetic, Division

Nonrestoring Division with Signed Operands
qk–j = 0 means no subtraction (or subtraction of 0) qk–j = 1 means subtraction of d Nonrestoring division We always subtract or add It is as if quotient digits are selected from the set {1, -1}: 1 corresponds to subtraction corresponds to addition Our goal is to end up with a remainder that matches the sign of the dividend This idea of trying to match the sign of s with the sign of z, leads to a direct signed division algorithm if sign(s) = sign(d) then qk–j = 1 else qk–j = -1 Example: q = May 2012 Computer Arithmetic, Division

Quotient Conversion and Final Correction
-d +d 2 z Partial remainder variation and selected quotient digits during nonrestoring division with d > 0 Quotient with digits -1 and 1 Replace -1s with 0s Shift left, complement MSB, and set LSB to 1 to get the 2’s-complement quotient Check: – 8 – = -25 = Final correction step if sign(s)  sign(z): Add d to, or subtract d from, s; subtract 1 from, or add 1 to, q May 2012 Computer Arithmetic, Division

Example of Nonrestoring Signed Division
======================== z 24d –24d s(0) 2s(0) sign(s(0))  sign(d), +24d so set q3 = -1 and add –––––––––––––––––––––––– s(1) 2s(1) sign(s(1)) = sign(d), +(–24d) so set q2 = 1 and subtract s(2) 2s(2) sign(s(2))  sign(d), +24d so set q1 = -1 and add s(3) 2s(3) sign(s(3)) = sign(d), +(–24d) so set q0 = 1 and subtract s(4) sign(s(4))  sign(z), +(–24d) so perform corrective subtraction s(4) s q ======================== Fig Example of nonrestoring signed division. p = Shift, compl MSB Add 1 to correct Check: 33/(-7) = -4 May 2012 Computer Arithmetic, Division

Nonrestoring Hardware Divider
Fig Shift-subtract sequential nonrestoring divider. May 2012 Computer Arithmetic, Division

13.5 Division by Constants Software and hardware aspects: As was the case for multiplications by constants, optimizing compilers may replace some divisions by shifts/adds/subs; likewise, in custom VLSI circuits, hardware dividers may be replaced by simpler adders Method 1: Find the reciprocal of the constant and multiply (particularly efficient if several numbers must be divided by the same divisor) Method 2: Use the property that for each odd integer d, there exists an odd integer m such that d  m = 2 n – 1; hence, d = (2 n – 1)/m and Multiplication by constant Shift-adds Number of shift-adds required is proportional to log k May 2012 Computer Arithmetic, Division

Example Division by a Constant
Example: Dividing the number z by 5, assuming 24 bits of precision. We have d = 5, m = 3, n = 4; 5  3 = 24 – 1 Instruction sequence for division by 5 q  z + z shift-left 1 {3z computed} q  q + q shift-right 4 {3z(1 + 2–4) computed} q  q + q shift-right 8 {3z(1 + 2–4)(1 + 2–8) computed} q  q + q shift-right 16 {3z(1 + 2–4)(1 + 2–8)(1 + 2–16) computed} q  q shift-right 4 {3z(1 + 2–4)(1 + 2–8)(1 + 2–16)/16 computed} 5 shifts 4 adds May 2012 Computer Arithmetic, Division

Numerical Examples for Division by 5
Instruction sequence for division by 5 q  z + z shift-left 1 {3z computed} q  q + q shift-right 4 {3z(1 + 2–4) computed} q  q + q shift-right 8 {3z(1 + 2–4)(1 + 2–8) computed} q  q + q shift-right 16 {3z(1 + 2–4)(1 + 2–8)(1 + 2–16) computed} q  q shift-right 4 {3z(1 + 2–4)(1 + 2–8)(1 + 2–16)/16 computed} Computing 29  5 (z = 29, d = 5) 87  shift-left 1 {3z computed} 92  shift-right 4 {3z(1 + 2–4) computed} 92  shift-right 8 {3z(1 + 2–4)(1 + 2–8) computed} 92  shift-right 16 {3z(1 + 2–4)(1 + 2–8)(1 + 2–16) computed} 5  92 shift-right 4 {3z(1 + 2–4)(1 + 2–8)(1 + 2–16)/16 computed} Repeat the process for computing 30  5 and comment on the outcome May 2012 Computer Arithmetic, Division

13.6 Radix-2 SRT Division SRT division takes its name from Sweeney, Robertson, and Tocher, who independently discovered the method s(j) = 2s(j–1) – q–j d with s(0) = z s(k) = 2ks q–j  {-1, 1} Fig The new partial remainder, s(j), as a function of the shifted old partial remainder, 2s(j–1), in radix-2 nonrestoring division. May 2012 Computer Arithmetic, Division

Allowing 0 as a Quotient Digit in Nonrestoring Division
This method was useful in early computers, because the choice q–j = 0 requires shifting only, which was faster than shift-and-subtract s(j) = 2s(j–1) – q–j d with s(0) = z s(k) = 2ks q–j  {-1, 0, 1} Fig The new partial remainder, s(j), as a function of the shifted old partial remainder, 2s(j–1), with q–j in {-1, 0, 1}. May 2012 Computer Arithmetic, Division

The Radix-2 SRT Division Algorithm
We use the comparison constants -½ and ½ for quotient digit selection 2s  +½ means 2s = (0.1xxxxxxxx)2’s-compl 2s < -½ means 2s = (1.0xxxxxxxx)2’s-compl s(j) = 2s(j–1) – q–j d with s(0) = z s(k) = 2ks s(j)  [-½, ½ ) q–j  {-1, 0, 1} Fig The relationship between new and old partial remainders in radix-2 SRT division. May 2012 Computer Arithmetic, Division

Radix-2 SRT Division with Variable Shifts
We use the comparison constants -½ and ½ for quotient digit selection For 2s  +½ or 2s = (0.1xxxxxxxx)2’s-compl choose q–j = 1 For 2s < -½ or 2s = (1.0xxxxxxxx)2’s-compl choose q–j = -1 Choose q–j = 0 in other cases, that is, for: 0  2s < +½ or 2s = (0.0xxxxxxxx)2’s-compl -½  2s < 0 or 2s = (1.1xxxxxxxx)2’s-compl Observation: What happens when the magnitude of 2s is fairly small? 2s = ( xxxx)2’s-compl 2s = (1.1110xxxxx)2’s-compl Choosing q–j = 0 would lead to the same condition in the next step; generate 5 quotient digits Generate 4 quotient digits Use leading 0s or leading 1s detection circuit to determine how many quotient digits can be spewed out at once Statistically, the average skipping distance will be 2.67 bits May 2012 Computer Arithmetic, Division

Example Unsigned Radix-2 SRT Division
In [-½, ½), so okay Example Unsigned Radix-2 SRT Division ======================== z d –d s(0) 2s(0)  ½, so set q-1 = 1 +(-d) and subtract –––––––––––––––––––––––– s(1) 2s(1) In [-½, ½), so set q-2 = 0 s(2) = 2s(1) 2s(2) In [-½, ½), so set q-3 = 0 s(3) = 2s(2) 2s(3) < -½, so set q-4 = -1 +d and add s(4) Negative, +d so add to correct s(4) s q Uncorrected BSD quotient q Convert and subtract ulp 0.1 Choose 1 1.0 Choose -1 0.0/1.1 Choose 0 Fig Example of unsigned radix-2 SRT division. May 2012 Computer Arithmetic, Division

Preview of Fast Dividers
Multiplication and division as multioperand addition problems. Like multiplication, division is multioperand addition Thus, there are but two ways to speed it up: a. Reducing the number of operands (divide in a higher radix) b. Adding them faster (keep partial remainder in carry-save form) There is one complication that makes division inherently more difficult: The terms to be subtracted from (added to) the dividend are not known a priori but become known as quotient digits are computed; quotient digits in turn depend on partial remainders May 2012 Computer Arithmetic, Division

14 High-Radix Dividers Chapter Goals Study techniques that allow us to obtain more than one quotient bit in each cycle (two bits in radix 4, three in radix 8, . . .) Chapter Highlights Radix > 2  quotient digit selection harder Remedy: redundant quotient representation Carry-save addition reduces cycle time Quotient digit selection Implementation methods and tradeoffs May 2012 Computer Arithmetic, Division

High-Radix Dividers: Topics
Topics in This Chapter 14.1 Basics of High-Radix Division 14.2 Using Carry-Save Adders 14.3 Radix-4 SRT Division 14.4 General High-Radix Dividers 14.5 Quotient Digit Selection 14.6 Using p-d Plots in Practice May 2012 Computer Arithmetic, Division

14.1 Basics of High-Radix Division
k digits r z qk–j rk d Radices of practical interest are powers of 2, and perhaps 10 Division with left shifts s(j) = r s(j–1) – qk–j (r k d) with s(0) = z and |–shift–| s(k) = r k s |––– subtract –––| Fig. 14.1 Radix-4 division in dot notation May 2012 Computer Arithmetic, Division

Difficulty of Quotient Digit Selection
What is the first quotient digit in the following radix-10 division? _____________ | 12 / 2 = 6 122 / 20 = 6 1225 / 204 = 6 12257 / 2043 = 5 The problem with the pencil-and-paper division algorithm is that there is no room for error in choosing the next quotient digit In the worst case, all k digits of the divisor and k + 1 digits in the partial remainder are needed to make a correct choice Suppose we used the redundant signed digit set [–9, 9] in radix 10 Then, we could choose 6 as the next quotient digit, knowing that we can recover from an incorrect choice by using negative digits: = 6 -1 May 2012 Computer Arithmetic, Division

Examples of High-Radix Division
Radix-4 integer division Radix-10 fractional division ====================== ================= z zfrac 44d dfrac s(0) s(0) 4s(0) s(0) –q3 44d {q3 = 1} –q–1d {q–1 = 7} ––––––––––––––––––––––– –––––––––––––––––– s(1) s(1) 4s(1) s(1) –q2 44d {q2 = 0} –q–2d {q–2 = 0} s(2) s(2) 4s(2) sfrac –q1 44d {q1 = 1} qfrac ––––––––––––––––––––––– ================= s(3) 4s(3) –q0 44d {q0 = 2} ––––––––––––––––––––––– s(4) s q ====================== Fig Examples of high-radix division with integer and fractional operands. May 2012 Computer Arithmetic, Division

14.2 Using Carry-Save Adders
Fig Constant thresholds used for quotient digit selection in radix-2 division with qk–j in {–1, 0, 1} . May 2012 Computer Arithmetic, Division

Quotient Digit Selection Based on Truncated PR
t := u[–2,1] + v[–2,1] if t < –½ then q–j = –1 else if t ≥ 0 then q–j = 1 else q–j = 0 endif Fig. 14.3 Sum part of 2s(j–1): u = (u1u0 . u–1u– )2’s-compl Carry part of 2s(j–1): v = (v1v0 . v–1v– )2’s-compl Approximation to the partial remainder: t = u[–2,1] + v[–2,1] {Add the 4 MSBs of u and v} Max error in approximation < ¼ + ¼ = ½ Error in [0, ½) May 2012 Computer Arithmetic, Division

Divider with Partial Remainder in Carry-Save Form
Fig Block diagram of a radix-2 divider with partial remainder in stored-carry form. May 2012 Computer Arithmetic, Division

Why We Cannot Use Carry-Save PR with SRT Division
Fig Overlap regions in radix-2 SRT division. May 2012 Computer Arithmetic, Division

14.4 Choosing the Quotient Digits
Fig. 14.3 Fig A p-d plot for radix-2 division with d  [1/2,1), partial remainder in [–d, d), and quotient digits in [–1, 1]. May 2012 Computer Arithmetic, Division

Design of the Quotient Digit Selection Logic
Shifted sum = (u1u0 . u-1u )2’s-compl t := u[–2,1] + v[–2,1] if t < –½ then q–j = –1 else if t ≥ 0 then q–j = 1 else q–j = 0 endif Shifted carry = (v1v0 . v-1v )2’s-compl 4-bit adder Approx shifted PR = (t1t0 . t-1t-2)2’s-compl Combinational logic Non0 = t1  t0  t–1 = (t1 t0 t-1) Sign = t1 (t0  t-1) Sign Non0 May 2012 Computer Arithmetic, Division

14.3 Radix-4 SRT Division Radix-4 fractional division with left shifts and q–j  [–3, 3] s(j) = 4 s(j–1) – q–j d with s(0) = z and s(k) = 4k s |–shift–| |–– subtract ––| Fig New versus shifted old partial remainder in radix-4 division with q–j in [–3, 3]. Two difficulties: How do you choose from among the 7 possible values for q-j? If the choice is +3 or -3, how do you form 3d? May 2012 Computer Arithmetic, Division

Building the p-d Plot for Radix-4 Division
Uncertainty region Uncertainty region Fig A p-d plot for radix-4 SRT division with quotient digit set [–3, 3]. May 2012 Computer Arithmetic, Division

Restricting the Quotient Digit Set in Radix 4
Radix-4 fractional division with left shifts and q–j  [–2, 2] s(j) = 4 s(j–1) – q–j d with s(0) = z and s(k) = 4k s |–shift–| |–– subtract ––| Fig New versus shifted old partial remainder in radix-4 division with q–j in [–2, 2]. For this restriction to be feasible, we must have: s  [-hd, hd) for some h < 1, and 4hd – 2d  hd This yields h  2/3 (choose h = 2/3 to minimize the restriction) May 2012 Computer Arithmetic, Division

Building the p-d Plot with Restricted Radix-4 Digit Set
Fig A p-d plot for radix-4 SRT division with quotient digit set [–2, 2]. May 2012 Computer Arithmetic, Division

14.4 General High-Radix Dividers
Process to derive the details: Radix r Digit set [–, ] for q–j Number of bits of p (v and u) and d to be inspected Quotient digit selection unit (table or logic) Multiple generation/selection scheme Conversion of redundant q to 2’s complement Fig Block diagram of radix-r divider with partial remainder in stored-carry form. May 2012 Computer Arithmetic, Division

Multiple Generation for High-Radix Division
Example: Digit set [–6, 6] for r = 8 Option 1: precompute 3a and 5a Option 2: generate a multiple |q–j|a as a set of two numbers, one chosen from {0, a, 2a} and another from {0, a, 4a} 0 a 2a 0 a 4a May 2012 Computer Arithmetic, Division

14.5 Quotient Digit Selection
Radix-r division with quotient digit set [–a, a], a < r – 1 Restrict the partial remainder range, say to [–hd, hd) From the solid rectangle in Fig. 15.1, we get rhd – ad  hd or h  a/(r – 1) To minimize the range restriction, we choose h = a/(r – 1) Example: r = 4, a = 2  h = 2/3 r – 1 +1 +a –r + 1 –a –1 ad d hd –hd rhd –rhd –ad –d –rd rd . . . rs(j–1) s(j) Fig The relationship between new and shifted old partial remainders in radix-r division with quotient digits in [–a, +a]. May 2012 Computer Arithmetic, Division

Why Using Truncated p and d Values Is Acceptable
Standard p xx.xxxx Carry-save p xx.xxxxx Fig A part of p-d plot showing the overlap region for choosing the quotient digit value b or b+1 in radix-r division with quotient digit set [–a, a]. May 2012 Computer Arithmetic, Division

Table Entries in the Quotient Digit Selection Logic
Fig A part of p-d plot showing an overlap region and its staircase-like selection boundary. May 2012 Computer Arithmetic, Division

14.6 Using p-d Plots in Practice
Smallest Dd occurs for the overlap region of a and a – 1 Fig Establishing upper bounds on the dimensions of uncertainty rectangles. May 2012 Computer Arithmetic, Division

Example: Lower Bounds on Precision
Fig For r = 4, divisor range [0.5, 1), digit set [–2, 2], we have a = 2, d min = 1/2, h = a/(r – 1) = 2/3 Because 1/8 = 2–3 and 2–3  1/6 < 2–2, we must inspect at least 3 bits of d (2, given its leading 1) and 3 bits of p These are lower bounds and may prove inadequate In fact, 3 bits of p and 4 (3) bits of d are required With p in carry-save form, 4 bits of each component must be inspected May 2012 Computer Arithmetic, Division

Upper Bounds for Precision
Theorem: Once lower bounds on precision are determined based on d and p, one more bit of precision in each direction is always adequate Proof: Let w be the spacing of vertical grid lines w  d/  v  p/  u  p/2 May 2012 Computer Arithmetic, Division

Some Implementation Details
Fig Example of p-d plot allowing larger uncertainty rectangles, if the 4 cases marked with asterisks are handled as exceptions. Fig The asymmetry of quotient digit selection process. May 2012 Computer Arithmetic, Division

A Complete p-d Plot Radix r = 4 q–j in [–2, 2] d in [1/2, 1) p in [–8/3, 8/3] Explanation of the Pentium division bug May 2012 Computer Arithmetic, Division

15 Variations in Dividers
Chapter Goals Discuss some variations in implementing division schemes and cover combinational, modular, and merged hardware dividers Chapter Highlights Prescaling simplifies q digit selection Overlapped q digit selection Parallel hardware (array) dividers Shared hardware in multipliers/dividers Square-rooting not special case of division May 2012 Computer Arithmetic, Division

Variations in Dividers: Topics
Topics in This Chapter 15.1 Division with Prescaling 15.2 Overlapped Quotient Digit Selection 15.3 Combinational and Array Dividers 15.4 Modular Dividers and Reducers 15.5 The Special Case of Reciprocation 15.6 Combined Multiply/Divide Units May 2012 Computer Arithmetic, Division

15.1 Division with Prescaling
Overlap regions of a p-d plot are wider toward the high end of the divisor range If we can restrict the magnitude of the divisor to an interval close to dmax (say 1 – e < d < 1 + d, when dmax = 1), quotient digit selection may become simpler Thus, we perform the division (zm)/(dm) for a suitably chosen scale factor m (m > 1) Prescaling (multiplying z and d by m) should be done without real multiplications Restricting the divisor to the shaded area simplifies quotient digit selection. May 2012 Computer Arithmetic, Division

Examples of Prescaling
Example 1: Unsigned divisor d in [1/2, 1) When d  [1/2, 3/4), multiply by 1½ [d begins 0.10…] The prescaled divisor will be in [1 – 1/4, 1 + 1/8) Example 2: Unsigned divisor d in [1/2, 1) Case d  [1/2, 9/16), it begins with …, multiply by 2 [9/16, 5/8), it begins with …, multiply by 1 + 1/2 [5/8, 3/4), it begins with 0.101…, multiply by 1 + 1/2 [3/4, 1), it begins with 0.11…, multiply by 1 + 1/8 [1/2, 9/16)  2 = [1, 1 + 1/8) [9/16, 5/8)  (1 + 1/2) = [1 – 5/32, 1 – 1/16) [5/8, 3/4)  (1 + 1/2) = [1 – 1/16, 1 + 1/8) [3/4, 1)  (1 + 1/8) = [1 – 5/32, 1 + 1/8) The prescaled divisor will be in [1 – 5/32, 1 + 1/8) May 2012 Computer Arithmetic, Division

15.2 Overlapped Quotient Digit Selection
–d 0 d Sum Carry CSA –d d qk–j qk–j+1 Quotient digit selector Mux Signal bundle A few bits Alternative to high-radix design when q digit selection is too complex Compute the next partial remainder and resulting q digit for all possible choices of the current q digit This is the same idea as carry-select addition Speculative computation (throw transistors at the delay problem) is common in modern systems Fig Overlapped radix-2 quotient digit selection for radix-4 division. A dashed line represents a signal pair that denotes a quotient digit value in [–1, 1]. May 2012 Computer Arithmetic, Division

15.3 Combinational and Array Dividers
Can take the notion of overlapped q digit selection to the extreme of selecting all q digits at once  Exponential complexity By contrast, a fully combinational tree multiplier has O(log k) latency and O(k2) cost O(k log k) conjectured Can we do as well as multipliers, or at least better than exponential cost, for logarithmic-time dividers? Complexity theory results: It is possible to design dividers with O(log k) latency and O(k4) cost with O(log k log log k) latency and O(k2) cost These theoretical constructions have not led to practical designs May 2012 Computer Arithmetic, Division

Restoring Array Divider
Fig Restoring array divider composed of controlled subtractor cells. May 2012 Computer Arithmetic, Division

Nonrestoring Array Divider
Fig Nonrestoring array divider built of controlled add/subtract cells. Similarity to array multiplier is deceiving Critical path May 2012 Computer Arithmetic, Division

Speedup Methods for Array Dividers
Idea: Pass the partial remainder downward in carry-save form to speed up the operation of each row Critical path Fig. 15.8 However, we still need to know the carry/borrow-out from each row Solution: Insert a carry-lookahead circuit between successive rows Not very cost-effective; thus not used in practice May 2012 Computer Arithmetic, Division

15.4 Modular Dividers and Reducers
Given dividend z and divisor d, with d  0, a modular divider computes q = z / d and s = z mod d = zd The quotient q is, by definition, an integer but the inputs z and d do not have to be integers; the modular remainder is always positive Example: –3.76 / 1.23 = –4 and –3.761.23 = 1.16 The quotient and remainder of ordinary division are -3 and -0.07 A modular reducer computes only the modular remainder and is in many cases simpler than a full-blown divider May 2012 Computer Arithmetic, Division

Montgomery Modular Reduction
Very efficient for reducing large numbers (100s of bits wide) The radix-2 version below is suitable for low-cost hardware realization Software versions are based on radix 232 or 264 (1 word = 1 digit) Problem: Compute q = ax mod m, where m < 2k Straightforward solution: Compute ax as usual; then reduce mod m Incremental reduction after adding each partial product is more efficient Assume a, x, q, and other values are k-bit pseudoresidues (can be > m) Pick R such that R = 1 mod m Montgomery multiplication computes axR–1 mod m, instead of ax mod m Represent any number y as yR mod m (known as the M-code for y) R = 1 mod m ensures that numbers in [0, m – 1] have distinct M-codes Multiplication: t = (aR)(xR)R–1 mod m = (ax)R mod m = M-code for ax Initial conversion: Find yR by applying Montgomery’s method to y and R2 Final reconversion: Find y from t = yR by M-multiplying 1 and t May 2012 Computer Arithmetic, Division

Example Montgomery Modular Multiplication
======================= =============== a a x x p(0) p(0) +x0a x0a –––––––––––––––––––––––– ––––––––––––––– 2p(1) p(1) Even p(1) p(1) +x1a x1a 2p(2) p(2) Odd p(2) +x2a ––––––––––––––– –––––––––––––––––––––––– 2p(2) 2p(3) p(2) p(3) x2a +x3a ––––––––––––––– –––––––––––––––––––––––– 2p(3) Even 2p(4) p(3) p(4) x3a ======================= ––––––––––––––– 2p(4) Odd ––––––––––––––– 2p(4) p(4) =============== Fig. 15.4 (a) Ordinary Example: r = 2; m = 13; R = 16 = r 4; R –1 = 9 mod 13 (because 16  9 = 1 mod 13) (b) Mod 13 May 2012 Computer Arithmetic, Division

Advantages of Montgomery’s Method
Standard reduction is based on subtracting a multiple of m from the result depending on the most significant bit(s) However, MSBs are not readily known if we use carry-save numbers In Montgomery reduction, the decision is based on LSB(s), thus allowing the use of carry-save arithmetic as well as parallel processing May 2012 Computer Arithmetic, Division

15.5 The Special Case of Reciprocation
(a) Squaring (b) Square-rooting? Multiplier p = ax a x y y2 Divider q = z / d z d  y (c) Reciprocation 1 / y 1 Fig Square-rooting is not a special case of division, but reciprocation is. Key question: Is reciprocation any faster than division? Answer: Not if a conventional digit recurrence algorithm is used May 2012 Computer Arithmetic, Division

Doubling the Speed of Reciprocation
Q  1/d with error  2–k/2 t = Q(2 – Qd)  1/d; error  2–k s(j+1) = 2s(j) – q–j d, with 2s(0) = 1 t(j+1) = 4t(j) + q–j (4s(j) – q–j d), with t(0) = 0 A: Digit-recurrence reciprocation to obtain Q  1/d Time saved d B: Digit-recurrence refinement to obtain q = Q(2 – Qd) q q–j Iterations for box A Iterations for box B Iterations for simple digit-recurrence reciprocation s(j) Fig Hybrid evaluation of the reciprocal 1/d by an approximate reciprocation stage and a refinement stage that operate concurrently. May 2012 Computer Arithmetic, Division

15.6 Combined Multiply/Divide Units
Similarity of blocks in multipliers and dividers (only shift direction is different) Fig. 9.4 Fig May 2012 Computer Arithmetic, Division

Single Unit for Sequential Multiplication and Division
The control unit proceeds through necessary steps for multiplication or division (including using the appropriate shift direction) The slight speed penalty owing to a more complex control unit is insignificant Fig Sequential radix-2 multiply/divide unit. May 2012 Computer Arithmetic, Division

Similarities of Array Multipliers and Array Dividers
Fig. 11.4 Fig. 15.8 May 2012 Computer Arithmetic, Division

Single Unit for Array Multiplication and Division
Each cell within the array can act as a modified adder or modified subtractor based on control input values In some designs, squaring and square-rooting functions are also included within the same array Fig I/O specification of a universal circuit that can act as an array multiplier or array divider. May 2012 Computer Arithmetic, Division

16 Division by Convergence
Chapter Goals Show how by using multiplication as the basic operation in each division step, the number of iterations can be reduced Chapter Highlights Digit-recurrence as convergence method Convergence by Newton-Raphson iteration Computing the reciprocal of a number Hardware implementation and fine tuning May 2012 Computer Arithmetic, Division

Division by Convergence: Topics
Topics in This Chapter 16.1 General Convergence Methods 16.2 Division by Repeated Multiplications 16.3 Division by Reciprocation 16.4 Speedup of Convergence Division 16.5 Hardware Implementation 16.6 Analysis of Lookup Table Size May 2012 Computer Arithmetic, Division

16.1 General Convergence Methods
Sequential digit-at-a-time (binary or high-radix) division can be viewed as a convergence scheme As each new digit of q = z / d is determined, the quotient value is refined, until it reaches the final correct value Convergence is from below in restoring division and oscillating in nonrestoring division Digit q 1 Meanwhile, the remainder s = z – q  d approaches 0; the scaled remainder is kept in a certain range, such as [– d, d) May 2012 Computer Arithmetic, Division

Elaboration on Scaled Remainder in Division
The partial remainder s(j) in division recurrence isn’t the true remainder but a version scaled by 2j Division with left shifts s(j) = 2s(j–1) – qk–j (2k d) with s(0) = z and |–shift–| s(k) = 2ks |––– subtract –––| Digit q 1 Quotient digit selection keeps the scaled remainder bounded (say, in the range –d to d) to ensure the convergence of the true remainder to 0 May 2012 Computer Arithmetic, Division

Recurrence Formulas for Convergence Methods
u (i+1) = f(u (i), v (i)) v (i+1) = g(u (i), v (i)) u (i+1) = f(u (i), v (i), w (i)) v (i+1) = g(u (i), v (i), w (i)) w (i+1) = h(u (i), v (i), w (i)) Constant Desired function Guide the iteration such that one of the values converges to a constant (usually 0 or 1) The other value then converges to the desired function The complexity of this method depends on two factors: a. Ease of evaluating f and g (and h) b. Rate of convergence (number of iterations needed) May 2012 Computer Arithmetic, Division

16.2 Division by Repeated Multiplications
Motivation: Suppose add takes 1 clock and multiply 3 clocks 64-bit divide takes 64 clocks in radix 2, 32 in radix 4  Divide faster via multiplications faster if 10 or fewer needed Idea: Converges to q Force to 1 Remainder often not needed, but can be obtained by another multiplication if desired: s = z – qd To turn the identity into a division algorithm, we face three questions: 1. How to select the multipliers x(i) ? 2. How many iterations (pairs of multiplications)? 3. How to implement in hardware? May 2012 Computer Arithmetic, Division

Formulation as a Convergence Computation
Idea: Force to 1 Converges to q d (i+1) = d (i) x (i) Set d (0) = d; make d (m) converge to 1 z (i+1) = z (i) x (i) Set z (0) = z; obtain z/d = q  z (m) Question 1: How to select the multipliers x (i) ? x (i) = 2 – d (i) This choice transforms the recurrence equations into: d (i+1) = d (i) (2 - d (i)) Set d (0) = d; iterate until d (m)  1 z (i+1) = z (i) (2 - d (i)) Set z (0) = z; obtain z/d = q  z (m) u (i+1) = f(u (i), v (i)) v (i+1) = g(u (i), v (i)) Fits the general form May 2012 Computer Arithmetic, Division

Determining the Rate of Convergence
d (i+1) = d (i) x (i) Set d (0) = d; make d (m) converge to 1 z (i+1) = z (i) x (i) Set z (0) = z; obtain z/d = q  z (m) Question 2: How quickly does d (i) converge to 1? We can relate the error in step i + 1 to the error in step i: d (i+1) = d (i) (2 - d (i)) = 1 – (1 – d (i))2 1 – d (i+1) = (1 – d (i))2 For 1 – d (i)  e, we get 1 – d (i+1)  e2: Quadratic convergence In general, for k-bit operands, we need 2m – 1 multiplications and m 2’s complementations where m = log2 k May 2012 Computer Arithmetic, Division

Quadratic Convergence
Table Quadratic convergence in computing z/d by repeated multiplications, where 1/2  d = 1 – y < 1 ––––––––––––––––––––––––––––––––––––––––––––––––––––––– i d (i) = d (i–1) x (i–1), with d (0) = d x (i) = 2 – d (i) 0 1 – y = (.1xxx xxxx xxxx xxxx)two  1/ y 1 1 – y 2 = (.11xx xxxx xxxx xxxx)two  3/ y 2 2 1 – y 4 = ( xxxx xxxx xxxx)two  15/ y 4 3 1 – y 8 = ( xxxx xxxx)two  255/ y 8 4 1 – y 16 = ( )two = 1 – ulp Each iteration doubles the number of guaranteed leading 1s (convergence to 1 is from below) Beginning with a single 1 (d  ½), after log2 k iterations we get as close to 1 as is possible in a fractional representation May 2012 Computer Arithmetic, Division

Graphical Depiction of Convergence to q
Question 3 (implementation in hardware) to be discussed later Fig Graphical representation of convergence in division by repeated multiplications. May 2012 Computer Arithmetic, Division

16.3 Division by Reciprocation
The Newton-Raphson method can be used for finding a root of f (x) = 0 Start with an initial estimate x(0) for the root Iteratively refine the estimate via the recurrence x(i+1) = x(i) – f (x(i)) / f (x(i)) Justification: tan a(i) = f (x(i)) = f (x(i)) / (x(i) – x(i+1)) Fig Convergence to a root of f(x) = 0 in the Newton-Raphson method. May 2012 Computer Arithmetic, Division

Computing 1/d by Convergence
1/d is the root of f (x) = 1/x – d f (x) = –1/x2 Substitute in the Newton-Raphson recurrence x(i+1) = x(i) – f (x(i)) / f (x(i)) to get: x (i+1) = x (i) (2 - x (i)d) One iteration = Two multiplications + One 2’s complementation Error analysis: Let d (i) = 1/d – x(i) be the error at the ith iteration d (i+1) = 1/d – x (i+1) = 1/d – x (i) (2 – x (i) d) = d (1/d – x (i))2 = d (d (i))2 Because d < 1, we have d (i+1) < (d (i))2 -d 1/d x f(x) May 2012 Computer Arithmetic, Division

Choosing the Initial Approximation to 1/d
With x(0) in the range 0 < x(0) < 2/d, convergence is guaranteed Justification: | d(0) | = | x(0) – 1/d | < 1/d d(1) = | x(1) – 1/d | = d (d(0))2 = (d d(0)) d(0) < d(0) For d in [1/2, 1): Simple choice x(0) = 1.5 Max error = 0.5 < 1/d Better approx x(0) = 4(3 – 1) – 2d = – 2d Max error  0.1 1 x 1/x 2 May 2012 Computer Arithmetic, Division

16.4 Speedup of Convergence Division
Compute y = 1/d Do the multiplication yz Division can be performed via 2 log2 k – 1 multiplications This is not yet very impressive 64-bit numbers, 3-ns multiplier  33-ns division Three types of speedup are possible: Fewer multiplications (reduce m) Narrower multiplications (reduce the width of some x(i)s) Faster multiplications May 2012 Computer Arithmetic, Division

Initial Approximation via Table Lookup
Convergence is slow in the beginning: it takes 6 multiplications to get 8 bits of convergence and another 5 to go from 8 bits to 64 bits d x(0) x(1) x(2) = ( )two Better approx Approx to 1/d Read this value, x(0+), directly from a table, thereby reducing 6 multiplications to 2 A 2w  w lookup table is necessary and sufficient for w bits of convergence after 2 multiplications Example with 4-bit lookup: d = xxxx (11/16  d < 12/16) Inverses of the two extremes are 16/11  and 16/12  So, is a good estimate for 1/d  = (11/8)  (11/16) = 121/128 =  = (11/8)  (3/4) = 33/32 = May 2012 Computer Arithmetic, Division

Visualizing the Convergence with Table Lookup
Fig Convergence in division by repeated multiplications with initial table lookup. May 2012 Computer Arithmetic, Division

Convergence Does Not Have to Be from Below
Fig Convergence in division by repeated multiplications with initial table lookup and the use of truncated multiplicative factors. May 2012 Computer Arithmetic, Division

Using Truncated Multiplicative Factors
Problem 16.9a A truncated denominator d (i), with a identical leading bits and b extra bits (b  a), leads to a new denominator d (i+1) with a + b identical leading bits Fig One step in convergence division with truncated multiplicative factors. Example (64-bit multiplication) Initial step: Table of size 256  8 = 2K bits Middle steps: Multiplication pairs, with 9-, 17-, and 33-bit multipliers Final step: Full 64  64 multiplication May 2012 Computer Arithmetic, Division

16.5 Hardware Implementation
Repeated multiplications: Each pair of ops involves the same multiplier d (i+1) = d (i) (2 - d (i)) Set d (0) = d; iterate until d (m)  1 z (i+1) = z (i) (2 - d (i)) Set z (0) = z; obtain z/d = q  z (m) Fig Two multiplications fully overlapped in a 2-stage pipelined multiplier. May 2012 Computer Arithmetic, Division

Implementing Division with Reciprocation
Reciprocation: Multiplication pairs are data-dependent, so they cannot be pipelined or performed in parallel x (i+1) = x (i) (2 - x (i)d) Options for speedup via a better initial approximation Consult a larger table Resort to a bipartite or multipartite table (see Chapter 24) Use table lookup, followed with interpolation Compute the approximation via multioperand addition Unless several multiplications by the same multiplier are needed, division by repeated multiplications is more efficient However, given a fast method for reciprocation (see Section 24.6), using a reciprocation unit with a standard multiplier is often preferred May 2012 Computer Arithmetic, Division

16.6 Analysis of Lookup Table Size
Table Sample entries in the lookup table replacing the first four multiplications in division by repeated multiplications ––––––––––––––––––––––––––––––––––––––––––––––––––––––– Address d = 0.1 xxxx xxxx x (0+) = 1. xxxx xxxx Example: Table entry at address (311/512  d < 312/512) For 8 bits of convergence, the table entry f must satisfy (311/512)(1 + . f)  1 – 2– (312/512)(1 + . f)  1 + 2–8 199/311  .f  101/ or ≤ 256  . f ≤ Two choices: = ( )two or 165 = ( )two May 2012 Computer Arithmetic, Division

A General Result for Table Size
Theorem 16.1: To get w  5 bits of convergence after the first iteration of division by repeated multiplications, w bits of d (beyond the mandatory 1) must be inspected. The factor x(0+) read out from table is of the form (1.xxx xxx)two, with w bits after the radix point Proof strategy for sufficiency: Represent the table entry 1.f as the integer v = 2w  .f and derive upper / lower bound expressions for it. Then, show that at least one integer exists between vlb and vub Proof strategy for necessity: Show that derived conditions cannot be met if the table is of size 2k–1 (no matter how wide) or if it is of width k – 1 (no matter how large) Excluded cases, w < 5: Practically uninteresting (allow smaller table) General radix r : Same analysis method, and results, apply May 2012 Computer Arithmetic, Division

Part V Real Arithmetic May 2012 Computer Arithmetic, Real Arithmetic
28. Reconfigurable Arithmetic Appendix: Past, Present, and Future May 2012 Computer Arithmetic, Real Arithmetic

This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN ). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. © Behrooz Parhami Edition Released Revised First Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 May 2007 May 2008 May 2009 Second May 2010 Apr. 2011 May 2012 May 2012 Computer Arithmetic, Real Arithmetic

Computer Arithmetic, Real Arithmetic
V Real Arithmetic Review floating-point numbers, arithmetic, and errors: How to combine wide range with high precision Format and arithmetic ops; the IEEE standard Causes and consequence of computation errors When can we trust computation results? Topics in This Part Chapter 17 Floating-Point Representations Chapter 18 Floating-Point Operations Chapter 19 Errors and Error Control Chapter 20 Precise and Certifiable Arithmetic May 2012 Computer Arithmetic, Real Arithmetic

“According to my calculation, you should float now ... I think ...”
“It’s an inexact science.” May 2012 Computer Arithmetic, Real Arithmetic

17 Floating-Point Representations
Chapter Goals Study a representation method offering both wide range (e.g., astronomical distances) and high precision (e.g., atomic distances) Chapter Highlights Floating-point formats and related tradeoffs The need for a floating-point standard Finiteness of precision and range Fixed-point and logarithmic representations as special cases at the two extremes May 2012 Computer Arithmetic, Real Arithmetic

Floating-Point Representations: Topics
Topics in This Chapter 17.1 Floating-Point Numbers 17.2 The IEEE Floating-Point Standard 17.3 Basic Floating-Point Algorithms 17.4 Conversions and Exceptions 17.5 Rounding Schemes 17.6 Logarithmic Number Systems May 2012 Computer Arithmetic, Real Arithmetic

17.1 Floating-Point Numbers
No finite number system can represent all real numbers Various systems can be used for a subset of real numbers Fixed-point  w . f Rational  p / q Floating-point  s be Logarithmic  logbx Low precision and/or range Difficult arithmetic Most common scheme Limiting case of floating-point Fixed-point numbers x = ( )two Small number y = ( )two Large number Square of neither number representable Floating-point numbers x =  s  be or  significand  baseexponent x =  2–5 y =  2+7 A floating-point number comes with two signs: Number sign, usually appears as a separate bit Exponent sign, usually embedded in the biased exponent May 2012 Computer Arithmetic, Real Arithmetic

Floating-Point Number Format and Distribution
Fig Typical floating-point number format. Fig Subranges and special values in floating-point number representations. 1.001  2–5 1.001  2+7 May 2012 Computer Arithmetic, Real Arithmetic

Floating-Point Before the IEEE Standard
Computer manufacturers tended to have their own hardware-level formats This created many problems, as floating-point computations could produce vastly different results (not just differing in the last few significant bits) To get a sense for the wide variations in floating-point formats, visit: In computer arithmetic, we talked about IBM, CDC, DEC, Cray, … formats and discussed their relative merits First IEEE standard for binary floating-point arithmetic was adopted in 1985 after years of discussion The 1985 standard was continuously discussed, criticized, and clarified for a couple of decades In 2008, after several years of discussion, a revised standard was issued May 2012 Computer Arithmetic, Real Arithmetic

17.2 The IEEE Floating-Point Standard
IEEE Standard (supersedes IEEE ) Also includes half- & quad-word binary, plus some decimal formats Fig The IEEE standard floating-point number representation formats. May 2012 Computer Arithmetic, Real Arithmetic

Overview of IEEE 754-2008 Standard Formats
Table Some features of the IEEE standard floating-point number representation formats –––––––––––––––––––––––––––––––––––––––––––––––––––––––– Feature Single / Short Double / Long Word width (bits) Significand bits hidden hidden Significand range [1, 2 – 2–23] [1, 2 – 2–52] Exponent bits Exponent bias Zero (0) e + bias = 0, f = 0 e + bias = 0, f = 0 Denormal e + bias = 0, f  0 e + bias = 0, f  0 represents  0.f  2–126 represents 0.f 2–1022 Infinity () e + bias = 255, f = 0 e + bias = 2047, f = 0 Not-a-number (NaN) e + bias = 255, f  0 e + bias = 2047, f  0 Ordinary number e + bias  [1, 254] e + bias  [1, 2046] e  [–126, 127] e  [–1022, 1023] represents 1.f  2e represents 1.f  2e min 2–126  1.2  10–38 2–1022  2.2  10–308 max   3.4    1.8  10308 May 2012 Computer Arithmetic, Real Arithmetic

Exponent Encoding Exponent encoding in 8 bits for the single/short (32-bit) IEEE 754 format Decimal code 1 126 127 128 254 255 Hex code 00 01 7E 7F 80 FE FF Exponent value –126 –1 +1 +127 f = 0: Representation of 0 f  0: Representation of subnormals, 0.f  2–126 f = 0: Representation of  f  0: Representation of NaNs 1.f  2e Exponent encoding in 11 bits for the double/long (64-bit) format is similar May 2012 Computer Arithmetic, Real Arithmetic

Special Operands and Subnormals
Biased value Ordinary FLP numbers , NaN 0, Subnormal ( 0.f  2–126) Operations on special operands: Ordinary number  (+) = 0 (+)  Ordinary number =  NaN + Ordinary number = NaN (1.f  2e ) Fig Subnormals in the IEEE single-precision format. Subnormals (1.00…01 – 1.00…00)2–126 = 2–149 May 2012 Computer Arithmetic, Real Arithmetic

Extended Formats Single extended Single extended [-1022, 1023] Bias is unspecified, but exponent range must include:  11 bits  32 bits Double extended [ , ]  15 bits  64 bits Double extended May 2012 Computer Arithmetic, Real Arithmetic

Requirements for Arithmetic
Results of the 4 basic arithmetic operations (+, -, , ) as well as square-rooting must match those obtained if all intermediate computations were infinitely precise That is, a floating-point arithmetic operation should introduce no more imprecision than the error attributable to the final rounding of a result that has no exact representation (this is the best possible) Example: ( )  ( ) Exact result Rounded result Error = ½ ulp May 2012 Computer Arithmetic, Real Arithmetic

17.3 Basic Floating-Point Algorithms
Addition Assume e1  e2; alignment shift (preshift) is needed if e1 > e2 ( s1  b e1) + ( s2  b e2) = ( s1  b e1) + ( s2 / b e1–e2)  b e1 = ( s1  s2 / b e1–e2)  b e1 =  s  b e Example: Rounding, overflow, and underflow issues discussed later May 2012 Computer Arithmetic, Real Arithmetic

Floating-Point Multiplication and Division
( s1  b e1)  ( s2  b e2) = ( s1  s2 )  b e1+e2 Multiplication Because s1  s2  [1, 4), postshifting may be needed for normalization Overflow or underflow can occur during multiplication or normalization ( s1  b e1) / ( s2  b e2) = ( s1 / s2 )  b e1-e2 Division Because s1 / s2  (0.5, 2), postshifting may be needed for normalization Overflow or underflow can occur during division or normalization May 2012 Computer Arithmetic, Real Arithmetic

Floating-Point Square-Rooting
For e even: s  b e = s  b e/2 For e odd: bs  b e = bs  b (e–1) / 2 After the adjustment of s to bs and e to e – 1, if needed, we have: s*  b e* = s*  b e*/2 Even In [1, 4) for IEEE 754 In [1, 2) for IEEE 754 Overflow or underflow is impossible; no postnormalization needed May 2012 Computer Arithmetic, Real Arithmetic

17.4 Conversions and Exceptions
Conversions from fixed- to floating-point Conversions between floating-point formats Conversion from high to lower precision: Rounding The IEEE standard includes five rounding modes: Round to nearest, ties away from 0 (rtna) Round to nearest, ties to even (rtne) [default rounding mode] Round toward zero (inward) Round toward + (upward) Round toward – (downward) May 2012 Computer Arithmetic, Real Arithmetic

Exceptions in Floating-Point Arithmetic
Divide by zero Overflow Underflow Inexact result: Rounded value not the same as original Invalid operation: examples include Addition (+) + (–) Multiplication 0   Division 0 / 0 or  /  Square-rooting operand < 0 Produce NaN as their results May 2012 Computer Arithmetic, Real Arithmetic

17.5 Rounding Schemes Whole part Fractional part xk–1xk– x1x0 . x–1x– x–l yk–1yk– y1y0 Round ulp The simplest possible rounding scheme: chopping or truncation xk–1xk– x1x0 . x–1x– x–l xk–1xk– x1x0 Chop ulp May 2012 Computer Arithmetic, Real Arithmetic

Truncation or Chopping
Fig Truncation or chopping of a 2’s-complement number (same as downward-directed rounding). Fig Truncation or chopping of a signed-magnitude number (same as round toward 0). May 2012 Computer Arithmetic, Real Arithmetic

Round to Nearest Number
rtna(x) Rounding has a slight upward bias. Consider rounding (xk–1xk–2 ... x1x0 . x–1x–2)two to an integer (yk–1yk–2 ... y1y0 . )two The four possible cases, and their representation errors are: x–1x– Round Error down 0 down –0.25 up up With equal prob., mean = 0.125 For certain calculations, the probability of getting a midpoint value can be much higher than 2–l Fig Rounding of a signed-magnitude value to the nearest number. May 2012 Computer Arithmetic, Real Arithmetic

Round to Nearest Even Number
Fig R* rounding or rounding to the nearest odd number. Fig Rounding to the nearest even number. May 2012 Computer Arithmetic, Real Arithmetic

A Simple Symmetric Rounding Scheme
Chop and force the LSB of the result to 1 Simplicity of chopping, with the near-symmetry or ordinary rounding Max error is comparable to chopping (double that of rounding) Fig Jamming or von Neumann rounding. May 2012 Computer Arithmetic, Real Arithmetic

ROM Rounding Fig ROM rounding with an 8  2 table. Example: Rounding with a 32  4 table Rounding result is the same as that of the round to nearest scheme in 31 of the 32 possible cases, but a larger error is introduced when x3 = x2 = x1 = x0 = x–1 = 1 xk– x4x3x2x1x0 . x–1x– x–l xk– x4y3y2y1y0 ROM ROM data ROM address May 2012 Computer Arithmetic, Real Arithmetic

Directed Rounding: Motivation
We may need result errors to be in a known direction Example: in computing upper bounds, larger results are acceptable, but results that are smaller than correct values could invalidate the upper bound This leads to the definition of directed rounding modes upward-directed rounding (round toward +) and downward-directed rounding (round toward –) (required features of IEEE floating-point standard) May 2012 Computer Arithmetic, Real Arithmetic

Directed Rounding: Visualization
Fig Truncation or chopping of a 2’s-complement number (same as downward-directed rounding). Fig Upward-directed rounding or rounding toward +. May 2012 Computer Arithmetic, Real Arithmetic

17.6 Logarithmic Number Systems
Sign-and-logarithm number system: Limiting case of FLP representation x = ± be  e = logb |x| We usually call b the logarithm base, not exponent base Using an integer-valued e wouldn’t be very useful, so we consider e to be a fixed-point number Fig Logarithmic number representation with sign and fixed-point exponent. May 2012 Computer Arithmetic, Real Arithmetic

Properties of Logarithmic Representation
The logarithm is often represented as a 2’s-complement number (Sx, Lx) = (sign(x), log2 |x|) Simple multiplication and division; harder add and subtract L(xy) = Lx + Ly L(x/y) = Lx – Ly Example: 12-bit, base-2, logarithmic number system  Sign Radix point The bit string above represents –2–  –(0.0011)ten Number range  (–216, 216); min = 2–16 May 2012 Computer Arithmetic, Real Arithmetic

Advantages of Logarithmic Representation
Fig Some of the possible ways of assigning 16 distinct codes to represent numbers. May 2012 Computer Arithmetic, Real Arithmetic

18 Floating-Point Operations
Chapter Goals See how adders, multipliers, and dividers are designed for floating-point operands (square-rooting postponed to Chapter 21) Chapter Highlights Floating-point operation = preprocessing + exponent and significand arithmetic + postprocessing (+ exception handling) Adders need preshift, postshift, rounding Multipliers and dividers are easy to design May 2012 Computer Arithmetic, Real Arithmetic

Floating-Point Operations: Topics
Topics in This Chapter 18.1 Floating-Point Adders / Subtractors 18.2 Pre- and Postshifting 18.3 Rounding and Exceptions 18.4 Floating-Point Multipliers and Dividers 18.5 Fused-Multiply-Add Units 18.6 Logarithmic Arithmetic Units May 2012 Computer Arithmetic, Real Arithmetic

18.1 Floating-Point Adders/Subtractors
Floating-Point Addition Algorithm Assume e1  e2; alignment shift (preshift) is needed if e1 > e2 ( s1  b e1) + ( s2  b e2) = ( s1  b e1) + ( s2 / b e1–e2)  b e1 = ( s1  s2 / b e1–e2)  b e1 =  s  b e Example: Like signs: Possible 1-position normalizing right shift Different signs: Left shift, possibly by many positions Overflow/underflow during addition or normalization - May 2012 Computer Arithmetic, Real Arithmetic

FLP Addition Hardware Isolate the sign, exponent, significand Reinstate the hidden 1 Convert operands to internal format Identify special operands, exceptions Fig Block diagram of a floating-point adder/subtractor. Other key parts of the adder: Significand aligner (preshifter): Sec. 18.2 Result normalizer (postshifter), including leading 0s detector/predictor: Sec. 18.2 Rounding unit: Sec. 18.3 Sign logic: Problem 18.2 Converting internal to external representation, if required, must be done at the rounding stage Combine sign, exponent, significand Hide (remove) the leading 1 Identify special outcomes, exceptions May 2012 Computer Arithmetic, Real Arithmetic

18.2 Pre- and Postshifting Fig One bit-slice of a single-stage pre-shifter. Fig. 18.3 Four-stage combinational shifter for preshifting an operand by 0 to 15 bits. May 2012 Computer Arithmetic, Real Arithmetic

Leading Zeros / Ones Detection or Prediction
Leading zeros prediction, with adder inputs (0x0.x–1x–2 ...)2’s-compl and (0y0.y–1y–2 ...)2’s-compl Ways in which leading 0s/1s are generated: p p p p g a a a a g p p p p g a a a a p p p p p a g g g g a p p p p a g g g g p Prediction might be done in two stages:  Coarse estimate, used for coarse shift  Fine tuning of estimate, used for fine shift In this way, prediction can be partially overlapped with shifting Fig Leading zeros/ones counting versus prediction. May 2012 Computer Arithmetic, Real Arithmetic

18.3 Rounding and Exceptions
Adder result = (coutz1z0 . z–1z– z–l G R S)2’s-compl Guard bit Round bit Sticky bit OR of all bits shifted past R Why only 3 extra bits? Amount of alignment right-shift One bit: G holds the bit that is shifted out, no precision is lost Two bits or more: Shifted significand has a magnitude in [0, 1/2) Unshifted significand has a magnitude in [1, 2) Difference of aligned significands has a magnitude in (1/2, 2) Normalization left-shift will be by at most one bit If a normalization left-shift actually takes place: R = 0, round down, discarded part < ulp/2 R = 1, round up, discarded part  ulp/2 The only remaining question is establishing whether the discarded part is exactly ulp/2 (for round to nearest even); S provides this information (1/2, 1) Shift left [1, 2) No shift May 2012 Computer Arithmetic, Real Arithmetic

Floating-Point Adder with Dual Data Paths
Amount of alignment right-shift One bit: Arbitrary left shift may be needed due to cancellation Two bits or more: Normalization left-shift will be by at most one bit Control Fig Conceptual view of significand handling in a dual-path floating-point adder. 2 or more bits preshift May 2012 Computer Arithmetic, Real Arithmetic

Implementation of Rounding for Addition
The effect of 1-bit normalization shifts on the rightmost few bits of the significand adder output is as follows: Before postshifting (z) z–l+1 z–l | G R S 1-bit normalizing right-shift z–l+2 z–l+1 | z–l G R  S 1-bit normalizing left-shift z–l G | R S 0 After normalization (Z) Z–l+1 Z–l | Z–l–1 Z–l–2 Z–l–3 Note that no rounding is needed in case of multibit left-shift, because full precision is preserved in this case Round to nearest even: Do nothing if Z–l–1 = 0 or Z–l = Z–l–2 = Z–l–3 = 0 Add ulp = 2–l otherwise May 2012 Computer Arithmetic, Real Arithmetic

Exceptions in Floating-Point Addition
Overflow/underflow detected by exponent adjustment block in Fig. 18.1 Overflow can occur only for normalizing right-shift Underflow possible only with normalizing left shifts Exceptions involving NaNs and invalid operations handled by unpacking and packing blocks in Fig. 18.1 Zero detection: Special case of leading 0s detection Determining when “inexact” exception must be signaled left as an exercise May 2012 Computer Arithmetic, Real Arithmetic

18.4 Floating-Point Multipliers and Dividers
( s1  b e1)  ( s2  b e2) = ( s1  s2 )  b e1+e2 s1  s2  [1, 4): may need postshifting Overflow or underflow can occur during multiplication or normalization Speed considerations Many multipliers produce the lower half of the product (rounding info) early Need for normalizing right-shift is known at or near the end Hence, rounding can be integrated in the generation of the upper half, by producing two versions of these bits Fig Block diagram of a floating-point multiplier (divider). May 2012 Computer Arithmetic, Real Arithmetic

Floating-Point Dividers
( s1  b e1) / ( s2  b e2) = ( s1 / s2 )  b e1-e2 s1 / s2  (0.5, 2): may need postshifting Overflow or underflow can occur during division or normalization Note: Square-rooting never leads to overflow or underflow Rounding considerations Quotient must be produced with two extra bits (G and R), in case of the need for a normalizing left shift The remainder acts as the sticky bit Fig Block diagram of a floating-point multiplier (divider). May 2012 Computer Arithmetic, Real Arithmetic

18.5 Fused-Multiply-Add Units
Multiply-add operation: p = ax + b The most useful operation beyond the five basic ones Application 1: Polynomial evaluation f(z) = c(n–1)zn–1 + c(n–2)zn– c(1)z + c(0) s := s z + c(j) for j from n – 1 downto 0; initialize s to 0 Application 2: Dot-product computation u . v = u(0)v(0) + u(1)v(1) u(n–1)v(n–1) s := s + u(j)v(j) for j from 0 upto n – 1; initialize s to 0 Straightforward implementation: Use a multiplier that keeps its entire double-width product, followed by a double-width adder May 2012 Computer Arithmetic, Real Arithmetic

Design of a Fast FMA Unit
. . . Exponents sa Alignment preshift ax in stored-carry form Carry-save adder tree Multiples formation Adder Normalization To rounding Leading 0s/1s prediction ea + ex – eb Preshift may be to right or left Carry-save adder sx sb Significands Optimization 2 Multiply-add operation: p = ax + b Optimization 1 Can act as a simple adder (x = 1) or multiplier (b = 0) Optimization 3 Fig Block diagram of a fast FMA unit. May 2012 Computer Arithmetic, Real Arithmetic

18.6 Logarithmic Arithmetic Unit
Multiply/divide algorithm in LNS log(x y) = log x + log y log(x / y) = log x – log y Add/subtract algorithm in LNS (Sx, Lx)  (Sy, Ly) = (Sz, Lz) Assume x > y > 0 (other cases are similar) Lz = log z = log(x  y) = log(x(1  y/x)) = log x + log(1  y/x) Given  = – (log x – log y), the term log(1  y/x) = log(1 ± log–1) is obtained from a table (two tables + and – needed) log(x + y) = log x + +() log(x – y) = log x + –() May 2012 Computer Arithmetic, Real Arithmetic

Four-Function Logarithmic Arithmetic Unit
op Add/ Sub Lx > Ly? ROM for f+, f– Lm Lx Ly Sx Sy Lz Sz Muxes 1 Control Add/Sub1 Add/Sub2 Address Data log(x y) = log x + log y log(x / y) = log x – log y Log of the scale factor m which allows values in [0, 1] to be represented as unsigned log’s log(x + y) = log x + +() log(x – y) = log x + –() Fig Arithmetic unit for a logarithmic number system. May 2012 Computer Arithmetic, Real Arithmetic

LNS Arithmetic for Wider Words
log(x + y) = log x + +() log(x – y) = log x + –() + is well-behaved; easy to interpolate – causes difficulties in [–1, 0] Use nonuniform segmentation for direct table lookup or for a scheme based on linear interpolation 10xxx.xxxxxxx 110xx.xxxxxxx 1110x.xxxxxxx 11110.xxxxxxx xxxxxx . . . May 2012 Computer Arithmetic, Real Arithmetic

19 Errors and Error Control
Chapter Goals Learn about sources of computation errors, consequences of inexact arithmetic, and methods for avoiding or limiting errors Chapter Highlights Representation and computation errors Absolute versus relative error Worst-case versus average error Why 3  (1/3) does not necessarily yield 1 Error analysis and bounding May 2012 Computer Arithmetic, Real Arithmetic

Errors and Error Control: Topics
Topics in This Chapter 19.1 Sources of Computational Errors 19.2 Invalidated Laws of Algebra 19.3 Worst-Case Error Accumulation 19.4 Error Distribution and Expected Errors 19.5 Forward Error Analysis 19.6 Backward Error Analysis May 2012 Computer Arithmetic, Real Arithmetic

19.1 Sources of Computational Errors
FLP approximates exact computation with real numbers Two sources of errors to understand and counteract: Representation errors e.g., no machine representation for 1/3, 2, or p Arithmetic errors e.g., (1 + 2–12)2 = –11 + 2–24 not representable in IEEE 754 short format We saw early in the course that errors due to finite precision can lead to disasters in life-critical applications May 2012 Computer Arithmetic, Real Arithmetic

Example Showing Representation and Arithmetic Errors
Example 19.1: Compute 1/99 – 1/100, using a decimal floating-point format with 4-digit significand in [1, 10) and single-digit signed exponent Precise result = 1/9900   10–4 (error  10–8 or 0.01%) Chopped to 3 decimals x = 1/99   10–2 Error  10–6 or 0.01% y = 1/100 =  10–2 Error = 0 z = x –fp y =  10–2 –  10–2 =  10–4 Error  10–6 or 1% May 2012 Computer Arithmetic, Real Arithmetic

Notation for a General Floating-Point System
Number representation in FLP(r, p, A) Radix r (assume to be the same as the exponent base b) Precision p in terms of radix-r digits Approximation scheme A  {chop, round, rtne, chop(g), } Let x = r es be an unsigned real number, normalized such that 1/r  s < 1, and assume xfp is the representation of x in FLP(r, p, A) xfp = r e sfp = (1 + h)x h is the relative representation error A = chop –ulp < sfp – s  0 –r  ulp < h  0 A = round –ulp/2 < sfp – s  ulp/2  h   r  ulp/2 Arithmetic in FLP(r, p, A) Obtain an infinite-precision result, then chop, round, . . . Real machines approximate this process by keeping g > 0 guard digits, thus doing arithmetic in FLP(r, p, chop(g)) May 2012 Computer Arithmetic, Real Arithmetic

Error Analysis for Multiplication and Division
Errors in floating-point multiplication Consider the positive operands xfp and yfp xfp fp yfp = (1 + h) xfp yfp = (1 + h)(1 + s)(1 + t) xy = (1 + h + s + t + hs + ht + st + hst) xy  (1 + h + s + t) xy Errors in floating-point division Again, consider positive operands xfp and yfp xfp /fp yfp = (1 + h) xfp / yfp = (1 + h)(1 + s)x / [(1 + t)y] = (1 + h)(1 + s)(1 – t)(1 + t2)(1 + t4)( ) x/y  (1 + h + s – t) x / y May 2012 Computer Arithmetic, Real Arithmetic

Error Analysis for Addition and Subtraction
Errors in floating-point addition Consider the positive operands xfp and yfp xfp +fp yfp = (1 + h)(xfp + yfp) = (1 + h)(x + sx + y + ty) sx + ty = (1 + h)( )(x + y) x + y Magnitude of this ratio is upper-bounded by max(| s | |, | t |), so the overall error is no more than | h | + max(| s | |, | t |) Errors in floating-point subtraction Again, consider positive operands xfp and yfp xfp -fp yfp = (1 + h)(xfp - yfp) = (1 + h)(x + sx - y - ty) sx - ty = (1 + h)( )(x - y) x - y Magnitude of this ratio can be very large if x and y are both large but x – y is relatively small (recall that t can be negative) This term also unbounded for subtraction May 2012 Computer Arithmetic, Real Arithmetic

Cancellation Error in Subtraction
sx - ty xfp -fp yfp = (1 + h)( )(x - y) Subtraction result x - y Example 19.2: Decimal FLP system, r = 10, p = 6, no guard digit x =  y = –  102 xfp =  yfp = –  102 x + y =  10–4 and xfp + yfp =  10–3 xfp +fp yfp =  103 -fp  103 =  10-2 Relative error = (10–3 –  10–4) / (0.544  10–4)  = 1738% Now, ignore representation errors, so as to focus on the effect of h (measure relative error with respect to xfp + yfp, not x + y) Relative error = (10–3 – 10–4) / 10–4 = 9 = 900% May 2012 Computer Arithmetic, Real Arithmetic

Bringing Cancellation Errors in Check
sx - ty xfp -fp yfp = (1 + h)( )(x - y) Subtraction result x - y Example 19.2 (cont.): Decimal FLP system, r = 10, p = 6, 1 guard digit x =  y = –  102 xfp =  yfp = –  102 x + y =  10–4 and xfp + yfp =  10–3 xfp +fp yfp =  103 -fp  103 =  10-3 Relative error = (10–4 –  10–4) / (0.544  10–4)  = 83.8% Now, ignore representation errors, so as to focus on the effect of h (measure relative error with respect to xfp + yfp, not x + y) Relative error = Significantly better than 900%! May 2012 Computer Arithmetic, Real Arithmetic

How Many Guard Digits Do We Need?
sx - ty xfp -fp yfp = (1 + h)( )(x - y) Subtraction result x - y Theorem 19.1: In the floating-point system FLP(r, p, chop(g)) with g  1 and –x < y < 0 < x, we have: xfp +fp yfp = (1 + h)(xfp + yfp) with –r –p+1 < h < r–p–g+2 Corollary: In FLP(r, p, chop(1)) xfp +fp yfp = (1 + h)(xfp + yfp) with  h  < –r –p+1 So, a single guard digit is sufficient to make the relative arithmetic error in floating-point addition or subtraction comparable to relative representation error with truncation May 2012 Computer Arithmetic, Real Arithmetic

19.2 Invalidated Laws of Algebra
Many laws of algebra do not hold for floating-point arithmetic (some don’t even hold approximately) This can be a source of confusion and incompatibility Associative law of addition: a + (b + c) = (a + b) + c a =  b = –  c =  101 a +fp (b +fp c) =  105 +fp (–  105 +fp  101) =  105 –fp  105 =  101 (a +fp b) +fp c = (  105 –fp  105) +fp  101 =  101 +fp  101 =  101 Results differ by more than 20%! May 2012 Computer Arithmetic, Real Arithmetic

Elaboration on the Non-Associativity of Addition
Associative law of addition: a + (b + c) = (a + b) + c a =  b = –  c =  101 s1 s2 When we first compute s1 = b + c, the small value of c barely makes a dent, yielding a value for a + s1 that is not much affected by c When we first compute s2 = a + b, the result will be nearly 0, making the effect of c on the final sum s2 + c more pronounced May 2012 Computer Arithmetic, Real Arithmetic

Do Guard Digits Help with Laws of Algebra?
Invalidated laws of algebra are intrinsic to FLP arithmetic; problems are reduced, but don’t disappear, with guard digits Let’s redo our example with 2 guard digits Associative law of addition: a + (b + c) = (a + b) + c a =  b = –  c =  101 a +fp (b +fp c) =  105 +fp (–  105 +fp  101) =  105 –fp  105 =  101 (a +fp b) +fp c = (  105 –fp  105) +fp  101 =  101 +fp  101 =  101 Difference of about 0.1% is better, but still too high! May 2012 Computer Arithmetic, Real Arithmetic

Unnormalized Floating-Point Arithmetic
One way to reduce problems resulting from invalidated laws of algebra is to avoid normalizing computed floating-point results Let’s redo our example with unnormalized arithmetic Associative law of addition: a + (b + c) = (a + b) + c a =  b = –  c =  101 a +fp (b +fp c) =  105 +fp (–  105 +fp  101) =  105 –fp  105 =  105 (a +fp b) +fp c = (  105 –fp  105) +fp  101 =  105 +fp  101 Results are the same and also carry a kind of warning May 2012 Computer Arithmetic, Real Arithmetic

Other Invalidated Laws of Algebra with FLP Arithmetic
Associative law of multiplication a  (b  c) = (a  b)  c Cancellation law (for a > 0) a  b = a  c implies b = c Distributive law a  (b + c) = (a  b) + (a  c) Multiplication canceling division a  (b / a) = b Before the IEEE 754 floating-point standard became available and widely adopted, these problems were exacerbated by the use of many incompatible formats May 2012 Computer Arithmetic, Real Arithmetic

Effects of Algorithms on Result Precision
Example 19.3: The formula x = –b  d, with d = (b 2 – c)1/2, yielding the roots of the quadratic equation x 2 + 2bx + c = 0, can be rewritten as x = –c / (b  d) When c is small compared with b 2, the root –b + d will have a large error due to cancellation; in such a case, use –c / (b + d) for that root Confirmation that –b + d = –c / (b + d)  –c = d 2 – b 2 Example 19.4: The area of a triangle with sides a, b, and c (assume a  b  c) is given by the formula A = [s(s – a)(s – b)(s – c)]1/2 where s = (a + b + c)/2. When the triangle is very flat (needlelike), such that a  b + c, Kahan’s version returns accurate results: A = ¼ [(a + (b + c))(c – (a – b))(c + (a – b))(a + (b – c))]1/2 May 2012 Computer Arithmetic, Real Arithmetic

19.3 Worst-Case Error Accumulation
In a sequence of operations, round-off errors might add up The larger the number of cascaded computation steps (that depend on results from previous steps), the greater the chance for, and the magnitude of, accumulated errors With rounding, errors of opposite signs tend to cancel each other out in the long run, but one cannot count on such cancellations Practical implications: Perform intermediate computations with a higher precision than what is required in the final result Implement multiply-accumulate in hardware (DSP chips) Reduce the number of cascaded arithmetic operations; So, using computationally more efficient algorithms has the double benefit of reducing the execution time as well as accumulated errors May 2012 Computer Arithmetic, Real Arithmetic

Example: Inner-Product Calculation
Consider the computation z =  x(i) y(i), for i  [0, 1023] Max error per multiply-add step = ulp/2 + ulp/2 = ulp Total worst-case absolute error = 1024 ulp (equivalent to losing 10 bits of precision) A possible cure: keep the double-width products in their entirety and add them to compute a double-width result which is rounded to single-width at the very last step Multiplications do not introduce any round-off error Max error per addition = ulp2/2 Total worst-case error = 1024  ulp2/2 + ulp/2 Therefore, provided that overflow is not a problem, a highly accurate result is obtained May 2012 Computer Arithmetic, Real Arithmetic

Kahan’s Summation Algorithm
To compute s =  x(i), for i  [0, n – 1], more accurately: s  x(0) c  0 {c is a correction term} for i = 1 to n – 1 do y  x(i) – c {subtract correction term} z  s + y c  (z – s) – y {find next correction term} s  z endfor May 2012 Computer Arithmetic, Real Arithmetic

19.4 Error Distribution and Expected Errors
Probability density function for the distribution of radix-r floating-point significands is 1/(x ln r) Fig Probability density function for the distribution of normalized significands in FLP(r = 2, p, A). May 2012 Computer Arithmetic, Real Arithmetic

Maximum Relative Representation Error
MRRE = maximum relative representation error MRRE(FLP(r, p, chop)) = r –p+1 MRRE(FLP(r, p, round)) = r –p+1 / 2 From a practical standpoint, the distribution of errors and their expected values may be more important Limiting ourselves to positive significands, we define: ARRE(FLP(r, p, A)) = 1/(x ln r) is a probability density function May 2012 Computer Arithmetic, Real Arithmetic

19.5 Forward Error Analysis
Consider the computation y = ax + b and its floating-point version yfp = (afp fp xfp) +fp bfp = (1 + h)y Can we establish any useful bound on the magnitude of the relative error h, given the relative errors in the input operands afp, bfp, xfp? The answer is “no” Forward error analysis = Finding out how far yfp can be from ax + b, or at least from afp xfp + bfp, in the worst case May 2012 Computer Arithmetic, Real Arithmetic

Some Error Analysis Methods
Automatic error analysis Run selected test cases with higher precision and observe differences between the new, more precise, results and the original ones Significance arithmetic Roughly speaking, same as unnormalized arithmetic, although there are fine distinctions. The result of the unnormalized decimal addition  105 +fp  =  1010 warns us about precision loss Noisy-mode computation Random digits, rather than 0s, are inserted during normalizing left shifts If several runs of the computation in noisy mode yield comparable results, then we are probably safe Interval arithmetic An interval [xlo, xhi] represents x, xlo  x  xhi. With xlo, xhi, ylo, yhi > 0, to find z = x / y, we compute [zlo, zhi] = [xlo /fp yhi, xhi /fp ylo] Drawback: Intervals tend to widen after many computation steps May 2012 Computer Arithmetic, Real Arithmetic

19.6 Backward Error Analysis
Backward error analysis replaces the original question How much does yfp = afp fp xfp + bfp deviate from y? with another question: What input changes produce the same deviation? In other words, if the exact identity yfp = aalt xalt + balt holds for alternate parameter values aalt, balt, and xalt, we ask how far aalt, balt, xalt can be from afp, xfp, xfp Thus, computation errors are converted or compared to additional input errors May 2012 Computer Arithmetic, Real Arithmetic

Example of Backward Error Analysis
yfp = afp fp xfp +fp bfp = (1 + m)[afp fp xfp + bfp] with  m  < r – p + 1 = r  ulp = (1 + m)[(1 + n) afp xfp + bfp] with  n  < r – p + 1 = r  ulp = (1 + m) afp (1 + n) xfp + (1 + m) bfp = (1 + m)(1 + s)a (1 + n)(1 + d)x + (1 + m)(1 + g)b  (1 + s + m)a (1 + d + n)x + (1 + g + m)b So the approximate solution of the original problem is the exact solution of a problem close to the original one The analysis assures us that the effect of arithmetic errors on the result yfp is no more severe than that of r  ulp additional error in each of the inputs a, b, and x May 2012 Computer Arithmetic, Real Arithmetic

20 Precise and Certifiable Arithmetic
Chapter Goals Discuss methods for doing arithmetic when results of high accuracy or guaranteed correctness are required Chapter Highlights More precise computation through multi- or variable-precision arithmetic Result certification by means of exact or error-bounded arithmetic Precise / exact arithmetic with low overhead May 2012 Computer Arithmetic, Real Arithmetic

Precise and Certifiable Arithmetic: Topics
Topics in This Chapter 20.1 High Precision and Certifiability 20.2 Exact Arithmetic 20.3 Multiprecision Arithmetic 20.4 Variable-Precision Arithmetic 20.5 Error-Bounding via Interval Arithmetic 20.6 Adaptive and Lazy Arithmetic May 2012 Computer Arithmetic, Real Arithmetic

20.1 High Precision and Certifiability
There are two aspects of precision to discuss: Results possessing adequate precision Being able to provide assurance of the same We consider 3 distinct approaches for coping with precision issues: 1. Obtaining completely trustworthy results via exact arithmetic 2. Making the arithmetic highly precise to raise our confidence in the validity of the results: multi- or variable-precision arith 3. Doing ordinary or high-precision calculations, while tracking potential error accumulation (can lead to fail-safe operation) We take the hardware to be completely trustworthy Hardware reliability issues dealt with in Chapter 27 May 2012 Computer Arithmetic, Real Arithmetic

20.2 Exact Arithmetic Continued fractions Any unsigned rational number x = p/q has a unique continued-fraction expansion with a0  0, am  2, and ai  1 for 1  i  m – 1 Example: Continued fraction representation of 277/642 Can get approximations for finite representation by limiting the number of “digits” in the continued-fraction representation May 2012 Computer Arithmetic, Real Arithmetic

Fixed-Slash Number Systems
Represents p / q Fig Example fixed-slash number representation format. Rational number if p > 0 q > “rounded” to nearest value 0 if p = 0 q odd  if p odd q = 0 NaN (not a number) otherwise Waste due to multiple representations such as 3/5 = 6/10 = 9/15 = is no more than one bit, because: limn {p/q  1  p,q  n, gcd(p, q) = 1}/n2 = 6/2 = 0.608 May 2012 Computer Arithmetic, Real Arithmetic

Floating-Slash Number Systems
Represents p / q Fig Example floating-slash representation format. Set of numbers represented: {p/q  p,q  1, gcd(p, q) = 1, log2p + log2q  k – 2} Again the following mathematical result, due to Dirichlet, shows that the space waste is no more than one bit: limn {p/q  pq  n, gcd(p,q) = 1} / {p/q  pq  n, p,q  1} = 6/2 = 0.608 May 2012 Computer Arithmetic, Real Arithmetic

20.3 Multiprecision Arithmetic
Fig Example quadruple-precision integer format. Fig Example quadruple-precision floating-point format. May 2012 Computer Arithmetic, Real Arithmetic

Multiprecision Floating-Point Addition
Fig Quadruple-precision significands aligned for the floating-point addition z = x +fp y. May 2012 Computer Arithmetic, Real Arithmetic

Quad-Precision Arithmetic Using Two Doubles
xH =  220 x = xH + xL xL =  2–33 x =  220 Key idea used: One can obtain an accurate sum for two floating-point numbers by computing their regular sum s = x +fp y and an error term e = y – (s – x) The following website provides links to downloadable software packages for double-double and quad-double arithmetic May 2012 Computer Arithmetic, Real Arithmetic

20.4 Variable-Precision Arithmetic
Fig Example variable-precision integer format. Fig Example variable-precision floating-point format. May 2012 Computer Arithmetic, Real Arithmetic

Variable-Precision Floating-Point Addition
Fig Variable-precision floating-point addition. May 2012 Computer Arithmetic, Real Arithmetic

20.5 Error-Bounding via Interval Arithmetic
Interval definition [a, b], a  b, is an interval enclosing x, a  x  b (intervals model uncertainty in real-valued parameters) [a, a] represents the real number x = a [a, b], a > b, is the empty interval Combining and comparing intervals [xlo, xhi]  [ylo, yhi] = [max(xlo, ylo), min(xhi, yhi)] [xlo, xhi]  [ylo, yhi] = [min(xlo, ylo), max(xhi, yhi)] [xlo, xhi]  [ylo, yhi] iff xlo  ylo and xhi  yhi [xlo, xhi] = [ylo, yhi] iff xlo = ylo and xhi = yhi [xlo, xhi] < [ylo, yhi] iff xhi < ylo May 2012 Computer Arithmetic, Real Arithmetic

Arithmetic Operations on Intervals
Additive and multiplicative inverses –[xlo, xhi] = [–xhi, –xlo] 1 / [xlo, xhi] = [1/xhi, 1/xlo], provided that 0  [xlo, xhi] When 0  [xlo, xhi], the multiplicative inverse is [–,+] The four basic arithmetic operations [xlo, xhi] + [ylo, yhi] = [xlo + ylo, xhi + yhi] [xlo, xhi] – [ylo, yhi] = [xlo – yhi, xhi – ylo] [xlo, xhi]  [ylo, yhi] = [min(xloylo, xloyhi, xhiylo, xhiyhi), max(xloylo, xloyhi, xhiylo, xhiyhi)] [xlo, xhi] / [ylo, yhi] = [xlo, xhi]  [1/yhi, 1/ylo] May 2012 Computer Arithmetic, Real Arithmetic

Getting Narrower Result Intervals
Theorem 20.1: If f(x(1), x(2), , x(n)) is a rational expression in the interval variables x(1), x(2), , x(n), that is, f is a finite combination of x(1), x(2), , x(n) and a finite number of constant intervals by means of interval arithmetic operations, then x(i)  y(i), i = 1, 2, , n, implies: f(x(1), x(2), , x(n))  f(y(1), y(2), , y(n)) Thus, arbitrarily narrow result intervals can be obtained by simply performing arithmetic with sufficiently high precision With reasonable assumptions about machine arithmetic, we have: Theorem 20.2: Consider the execution of an algorithm on real numbers using machine interval arithmetic in FLP(r, p, |). If the same algorithm is executed using the precision q, with q > p, the bounds for both the absolute error and relative error are reduced by the factor rq–p (the absolute or relative error itself may not be reduced by this factor; the guarantee applies only to the upper bound) May 2012 Computer Arithmetic, Real Arithmetic

A Strategy for Accurate Interval Arithmetic
Theorem 20.2: Consider the execution of an algorithm on real numbers using machine interval arithmetic in FLP(r, p, |). If the same algorithm is executed using the precision q, with q > p, the bounds for both the absolute error and relative error are reduced by the factor rq–p (the absolute or relative error itself may not be reduced by this factor; the guarantee applies only to the upper bound) Let wmax be the maximum width of a result interval when interval arithmetic is used with p radix-r digits of precision. If wmax  e, then we are done. Otherwise, interval calculations with the higher precision q = p + logr wmax – logr e is guaranteed to yield the desired accuracy. May 2012 Computer Arithmetic, Real Arithmetic

The Interval Newton Method
1/x – d x 6 5 4 3 2 1 –1 I (0) N(I(0)) I (1) Slope = –1/4 Slope = –4 A x(i+1) = x(i) – f(x(i)) / f (x(i)) N(I (i)) = c(i) – f(c(i)) / f (I (i)) I (i+1) = I (i)  N(I (i)) Fig Illustration of the interval Newton method for computing 1/d. May 2012 Computer Arithmetic, Real Arithmetic

Laws of Algebra in Interval Arithmetic
As in FLP arithmetic, laws of algebra may not hold for interval arithmetic For example, one can readily construct an example where for intervals x, y and z, the following two expressions yield different interval results, thus demonstrating the violation of the distributive law: x(y + z) xy + xz Can you find other laws of algebra that may be violated? May 2012 Computer Arithmetic, Real Arithmetic

20.6 Adaptive and Lazy Arithmetic
Need-based incremental precision adjustment to avoid high-precision calculations dictated by worst-case errors Lazy evaluation is a powerful paradigm that has been and is being used in many different contexts. For example, in evaluating composite conditionals such as if cond1 and cond2 then action evaluation of cond2 may be skipped if cond1 yields “false” More generally, lazy evaluation means postponing all computations or actions until they become irrelevant or unavoidable Opposite of lazy evaluation (speculative or aggressive execution) has been applied extensively May 2012 Computer Arithmetic, Real Arithmetic

Lazy Arithmetic with Redundant Representations
Redundant number representations offer some advantages for lazy arithmetic Because redundant representations support MSD-first arithmetic, it is possible to produce a small number of result digits by using correspondingly less computational effort, until more precision is actually needed May 2012 Computer Arithmetic, Real Arithmetic

Part VI Function Evaluation
28. Reconfigurable Arithmetic Appendix: Past, Present, and Future May 2012 Computer Arithmetic, Function Evaluation

This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN ). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. © Behrooz Parhami Edition Released Revised First Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 June 2007 May 2008 May 2009 Second May 2010 May 2011 May 2012 May 2012 Computer Arithmetic, Function Evaluation

VI Function Evaluation
Learn hardware algorithms for evaluating useful functions Divisionlike square-rooting algorithms Evaluating sin x, tanh x, ln x, by series expansion Function evaluation via convergence computation Use of tables: the ultimate in simplicity and flexibility Topics in This Part Chapter 21 Square-Rooting Methods Chapter 22 The CORDIC Algorithms Chapter 23 Variation in Function Evaluation Chapter 24 Arithmetic by Table Lookup May 2012 Computer Arithmetic, Function Evaluation

Computer Arithmetic, Function Evaluation
May 2012 Computer Arithmetic, Function Evaluation

21 Square-Rooting Methods
Chapter Goals Learning algorithms and implementations for both digit-at-a-time and convergence square-rooting Chapter Highlights Square-rooting part of IEEE 754 standard Digit-recurrence (divisionlike) algorithms Convergence or iterative schemes Square-rooting not special case of division May 2012 Computer Arithmetic, Function Evaluation

Square-Rooting Methods: Topics
Topics in This Chapter 21.1 The Pencil-and-Paper Algorithm 21.2 Restoring Shift / Subtract Algorithm 21.3 Binary Nonrestoring Algorithm 21.4 High-Radix Square-Rooting 21.5 Square-Rooting by Convergence 21.6 Fast Hardware Square-Rooters May 2012 Computer Arithmetic, Function Evaluation

21.1 The Pencil-and-Paper Algorithm
Notation for our discussion of division algorithms: z Radicand z2k–1z2k– z3z2z1z0 q Square root qk–1qk– q1q0 s Remainder, z – q sksk–1sk– s1s0 Remainder range, 0  s  2q (k + 1 digits) Justification: s  2q + 1 would lead to z = q 2 + s  (q + 1)2 Fig Binary square-rooting in dot notation. May 2012 Computer Arithmetic, Function Evaluation

Example of Decimal Square-Rooting
Check: = 94, = 95,241 Root digit q2 q1 q0  q q(0) = 0  9    z q2 = 3 q(1) = 3 9  q1  q1  q1 = 0 q(2) = 30 0 0  q0  q0  q0 = 8 q(3) = 308  s = (377)ten q = (308)ten Partial root “sixty plus q1” Fig Extracting the square root of a decimal integer using the pencil-and-paper algorithm. May 2012 Computer Arithmetic, Function Evaluation

Square-Rooting as Division with Unknown Divisor
q3 q2 q1 q0 q3 q2 q3 q2 q1 q3 q2 q1 q0 q3 z7 z6 z5 z4 z3 z2 z1 z0 q3 q2 q1 q0 z7 z6 z5 z4 z3 z2 z1 z0 z7 z6 z5 z4 0 q3 in radix 2 q3 depends only on z7 z6 Justification: For e  0, the square of (q3 + e)r3 is q32r6 + (2e + 1)r6, leading to a change in z7 z6 z7 z6 z5 z4 z3 z2 z7 z6 z5 z4 z3 z2 z1 z0 Similarly, q2 depends only on z7 z6 z5 z4, and so on May 2012 Computer Arithmetic, Function Evaluation

Root Digit Selection Rule
The root thus far is denoted by q (i) = (qk–1qk– qk–i)ten Attaching the next digit qk–i–1, partial root becomes q (i+1) = 10 q (i) + qk–i–1 The square of q (i+1) is 100(q (i)) q (i) qk–i–1 + (qk–i–1)2 100(q (i))2 = (10 q (i))2 subtracted from partial remainder in previous steps Must subtract (10(2 q (i)) + qk–i–1)  qk–i–1 to get the new partial remainder More generally, in radix r, must subtract (r (2 q (i)) + qk–i–1)  qk–i–1 In radix 2, must subtract (4 q (i) + qk–i–1)  qk–i–1, which is 4 q (i) + 1 for qk–i–1 = 1, and 0 otherwise Thus, we use (qk–1qk– qk–I 0 1)two in a trial subtraction May 2012 Computer Arithmetic, Function Evaluation

Example of Binary Square-Rooting
Check: = 118 = ( )two Root digit Partial root q q2 q1 q0  q q(0) = 0  0 11 10 11 0  01? Yes q3 = 1 q(1) = 1 0 1  101? No q2 = 0 q(2) = 10  1001? Yes q1 = 1 q(3) = 101  10101? No q0 = 0 q(4) = 1010  s = (18)ten q = (1010)two = (10)ten Fig Extracting the square root of a binary integer using the pencil-and-paper algorithm. May 2012 Computer Arithmetic, Function Evaluation

21.2 Restoring Shift / Subtract Algorithm
Consistent with the IEEE 754 floating-point standard, we formulate our algorithms for a radicand in the range 1  z < 4 (after possible 1-bit shift for an odd exponent) 1  z < 4 Radicand z1z0 . z–1z– z–l 1  q < 2 Square root q–1q– q–l 0  s < 4 Remainder s1 s0 . s–1 s– s–l Fig. 21.3 Binary square-rooting is defined by the recurrence s (j) = 2s (j–1) – q–j(2q (j–1) + 2–j q–j) with s (0) = z – 1, q (0) = 1, s (j) = s where q (j) is the root up to its (–j)th digit; thus q = q (l) To choose the next root digit q–j  {0, 1}, subtract from 2s (j–1) the value 2q (j–1) + 2–j = (1 q-1(j–1) . q-2(j–1) q-j+1(j–1) 0 1)two A negative trial difference means q–j = 0 May 2012 Computer Arithmetic, Function Evaluation

Finding the Sq. Root of z = 1.110110 via the Restoring Algorithm
Root digit Partial root Finding the Sq. Root of z = via the Restoring Algorithm ================================ z (radicand = 118/64) s(0) = z – q0 = 1 1. 2s(0) –[2  (1.)+2–1] ––––––––––––––––––––––––––––––––– s(1) q–1 = 0 1.0 s(1) = 2s(0) Restore 2s(1) –[2  (1.0)+2–2] s(2) q–2 = 2s(2) –[2  (1.01)+2–3] s(3) q–3 = s(3) = 2s(2) Restore 2s(3) –[2  (1.010)+2–4] s(4) q–4 = 2s(4) –[2  (1.0101)+2–5] s(5) q–5 = 2s(5) –[2( )+2–6] s(6) q–6 = s(6) = 2s(5) Restore s (remainder = 156/64) q (root = 86/64) Fig Example of sequential binary square-rooting using the restoring algorithm. q–7 = 1, so round up May 2012 Computer Arithmetic, Function Evaluation

Hardware for Restoring Square-Rooting
l + 2 (l + 2) Fig Shift/subtract sequential restoring divider (for comparison). Fig Sequential shift/subtract restoring square-rooter. May 2012 Computer Arithmetic, Function Evaluation

Rounding the Square Root
In fractional square-rooting, the remainder is not needed To round the result, we can produce an extra digit q–l–1: Truncate for q–l–1 = 0, round up for q–l–1 = 1 Midway case, q–l–1 = 1 followed by all 0s, impossible (Prob ) Example: In Fig. 21.4, we had ( )two = ( )two2 + ( )/64 An extra iteration produces q–7 = 1 So the root is rounded up to q = ( )two = 87/64 The rounded-up value is closer to the root than the truncated version Original: 118/64 = (86/64) /(64)2 Rounded: 118/64 = (87/64)2 – 17/(64)2 May 2012 Computer Arithmetic, Function Evaluation

21.3 Binary Nonrestoring Algorithm
As in nonrestoring division, nonrestoring square-rooting implies: Root digits in {-1, 1} On-the-fly conversion to binary Possible final correction The case q–j = 1 (nonnegative partial remainder), is handled as in the restoring algorithm; i.e., it leads to the trial subtraction of q–j [2q (j–1) + 2–j q–j ] = 2q (j–1) + 2–j For q–j = -1, we must subtract q–j [2q (j–1) + 2–j q–j ] = – [2q (j–1) – 2–j ] which is equivalent to adding 2q (j–1) – 2–j Slight complication, compared with nonrestoring division This term cannot be formed by concatenation May 2012 Computer Arithmetic, Function Evaluation

Finding the Sq. Root of z = 1.110110 via the Nonrestoring Algorithm
================================ z (radicand = 118/64) s(0) = z – q0 = 1 1. 2s(0) q–1 = 1 1.1 –[2  (1.)+2–1] ––––––––––––––––––––––––––––––––– s(1) q–2 = 2s(1) +[2  (1.1)-2–2] s(2) q–3 = 2s(2) –[2  (1.01)+2–3] s(3) q–4 = 2s(3) +[2  (1.011)-2–4] s(4) q–5 = 2s(4) –[2  (1.0101)+2–5] s(5) q–6 = 2s(5) –[2( )+2–6] s(6) Negative; (-17/64) +[2( )-2–6] Correct s(6) Corrected (156/64) s (remainder = 156/64) (156/642) q (binary) (87/64) q (corrected binary) (86/64) Root digit Partial root Finding the Sq. Root of z = via the Nonrestoring Algorithm Fig Example of nonrestoring binary square-rooting. May 2012 Computer Arithmetic, Function Evaluation

Some Details for Nonrestoring Square-Rooting
Depending on the sign of the partial remainder, add: (positive) Add 2q (j–1) + 2–j (negative) Sub. 2q (j–1) – 2–j Concatenate 01 to the end of q (j–1) Cannot be formed by concatenation Solution: We keep q (j–1) and q (j–1) – 2–j+1 in registers Q (partial root) and Q* (diminished partial root), respectively. Then: q–j = Subtract 2q (j–1) + 2–j formed by shifting Q 01 q–j = Add 2q (j–1) – 2–j formed by shifting Q*11 Updating rules for Q and Q* registers: q–j = 1  Q := Q 1 Q* := Q 0 q–j = -1  Q := Q*1 Q* := Q*0 Additional rule for SRT-like algorithm that allow q–j = 0 as well: q–j = 0  Q := Q 0 Q* := Q*1 May 2012 Computer Arithmetic, Function Evaluation

21.4 High-Radix Square-Rooting
Basic recurrence for fractional radix-r square-rooting: s (j) = rs (j–1) – q–j(2 q (j–1) + r –j q–j) As in radix-2 nonrestoring algorithm, we can use two registers Q and Q* to hold q (j–1) and its diminished version q (j–1) – r –j+1, respectively, suitably updating them in each step Fig. 21.3 Radix-4 square-rooting in dot notation May 2012 Computer Arithmetic, Function Evaluation

An Implementation of Radix-4 Square-Rooting
s (j) = rs (j–1) – q–j(2 q (j–1) + r –j q–j) r = 4, root digit set [–2, 2] Q* holds q (j–1) – 4–j+1 = q (j–1) – 2–2j+2. Then, one of the following values must be subtracted from, or added to, the shifted partial remainder rs (j–1) q–j = 2 Subtract 4q (j–1) + 2–2j+2 double-shift Q 010 q–j = 1 Subtract 2q (j–1) + 2–2j shift Q 001 q–j = -1 Add 2q (j–1) – 2–2j shift Q*111 q–j = -2 Add 4q (j–1) – 2–2j+2 double-shift Q*110 Updating rules for Q and Q* registers: q–j = 2  Q := Q 10 Q* := Q 01 q–j = 1  Q := Q 01 Q* := Q 00 q–j = 0  Q := Q 00 Q* := Q*11 q–j = -1  Q := Q*11 Q* := Q*10 q–j = -2  Q := Q*10 Q* := Q*01 Note that the root is obtained in binary form (no conversion needed!) May 2012 Computer Arithmetic, Function Evaluation

Keeping the Partial Remainder in Carry-Save Form
As in fast division, root digit selection can be based on a few bits of the shifted partial remainder 4s (j–1) and of the partial root q (j–1) This would allow us to keep s in carry-save form One extra bit of each component of s (sum and carry) must be examined Can use the same lookup table for quotient digit and root digit selection To see how, compare recurrences for radix-4 division and square-rooting: Division: s (j) = 4s (j–1) – q–j d Square-rooting: s (j) = 4s (j–1) – q–j (2 q (j–1) + 4 –j q–j) To keep magnitudes of partial remainders for division and square-rooting comparable, we can perform radix-4 square-rooting using the digit set {-1, -½ , 0 , ½ , 1} Can convert from the digit set above to the digit set [–2, 2], or directly to binary, with no extra computation May 2012 Computer Arithmetic, Function Evaluation

21.5 Square-Rooting by Convergence
Newton-Raphson method Choose f(x) = x2 – z with a root at x = z x (i+1) = x (i) – f(x (i)) / f (x (i)) x (i+1) = 0.5(x (i) + z / x (i)) Each iteration: division, addition, 1-bit shift Convergence is quadratic x f(x) -z z For 0.5  z < 1, a good starting approximation is (1 + z)/2 This approximation needs no arithmetic The error is 0 at z = 1 and has a max of 6.07% at z = 0.5 The hardware approximation method of Schwarz and Flynn, using the tree circuit of a fast multiplier, can provide a much better approximation (e.g., to 16 bits, needing only two iterations for 64 bits of precision) May 2012 Computer Arithmetic, Function Evaluation

Initial Approximation Using Table Lookup
Table-lookup can yield a better starting estimate x (0) for z For example, with an initial estimate accurate to within 2–8, three iterations suffice to increase the accuracy of the root to 64 bits x (i+1) = 0.5(x (i) + z / x (i)) Example 21.1: Compute the square root of z = (2.4)ten x (0) read out from table = accurate to 10–1 x (1) = 0.5(x (0) / x (0)) = accurate to 10–2 x (2) = 0.5(x (1) / x (1)) = accurate to 10–4 x (3) = 0.5(x (2) / x (2)) = accurate to 10–8 Check: ( )2 = May 2012 Computer Arithmetic, Function Evaluation

Convergence Square-Rooting without Division
x (i+1) = 0.5(x (i) + z / x (i)) Rewrite the square-root recurrence as: x (i+1) = x (i) (1/x (i))(z – (x (i))2) = x (i) + 0.5g(x (i))(z – (x (i))2) where g(x (i)) is an approximation to 1/x (i) obtained by a simple circuit or read out from a table Because of the approximation used in lieu of the exact value of 1/x (i), convergence rate will be less than quadratic Alternative: Use the recurrence above, but find the reciprocal iteratively; thus interlacing the two computations Using the function f(y) = 1/y – x to compute 1/x, we get: x (i+1) = 0.5(x (i) + z y (i)) y (i+1) = y (i) (2 – x (i) y (i)) Convergence is less than quadratic but better than linear 3 multiplications, 2 additions, and a 1-bit shift per iteration May 2012 Computer Arithmetic, Function Evaluation

Example for Division-Free Square-Rooting
x (i+1) = 0.5(x (i) + z y (i)) y (i+1) = y (i) (2 – x (i) y (i)) x converges to z y converges to 1/z Example 21.2: Compute 1.4, beginning with x (0) = y (0) = 1 x (1) = 0.5(x (0) y (0)) = y (1) = y (0) (2 – x (0) y (0)) = x (2) = 0.5(x (1) y (1)) = y (2) = y (1) (2 – x (1) y (1)) = x (3) = 0.5(x (2) y (2)) = y (3) = y (2) (2 – x (2) y (2)) = x (4) = 0.5(x (3) y (3)) = y (4) = y (3) (2 – x (3) y (3)) = x (5) = 0.5(x (4) y (4)) = y (5) = y (4) (2 – x (4) y (4)) = x (6) = 0.5(x (5) y (5)) =  1.4 Check: ( )2 = May 2012 Computer Arithmetic, Function Evaluation

Another Division-Free Convergence Scheme
Based on computing 1/z, which is then multiplied by z to obtain z The function f(x) = 1/x2 – z has a root at x = 1/z (f (x) = –2/x3) x (i+1) = 0.5 x (i) (3 – z (x (i))2) Quadratic convergence 3 multiplications, 1 addition, and a 1-bit shift per iteration Example 21.3: Compute the square root of z = (.5678)ten x (0) read out from table = x (1) = 0.5x (0) (3 – (x (0))2) = x (2) = 0.5x (1) (3 – (x (1))2) = z  z  x (2) = Cray 2 supercomputer used this method. Initially, instead of x (0), the two values 1.5 x (0) and 0.5(x (0))3 are read out from a table, requiring only 1 multiplication in the first iteration. The value x (1) thus obtained is accurate to within half the machine precision, so only one other iteration is needed (in all, 5 multiplications, 2 additions, 2 shifts) May 2012 Computer Arithmetic, Function Evaluation

21.6 Fast Hardware Square-Rooters
Combinational hardware square-rooter serve two purposes: 1. Approximation to start up or speed up convergence methods 2. Replace digit recurrence or convergence methods altogether 4 3 2 1 z  17/24 + z/3 Best linear approx. z  1.5 z  1 + z/4 z  7/8 + z/4 z Subrange 1 Subrange 2 1 + (z – 1)/2 1 + z/4 z More subranges Better approx in each Fig Plot of the function z for 1  z < 4. May 2012 Computer Arithmetic, Function Evaluation

Nonrestoring Array Square-Rooters
Array square-rooters can be derived from the dot-notation representation in much the same way as array dividers Fig Nonrestoring array square-rooter built of controlled add/subtract cells incorporating full adders (FAs) and XOR gates. May 2012 Computer Arithmetic, Function Evaluation

Understanding the Array Square-Rooter Design
Partial root, transferred diagonally from row to row, is appended with: 01 if the last root digit was 1; with 11 if the last root digit was 0 May 2012 Computer Arithmetic, Function Evaluation

Nonrestoring Array Square-Rooter in Action
1 Check: 118/256 = (10/16)2 + (-3/256)? Note that the answer is approximate (to within 1 ulp) due to there being no final correction May 2012 Computer Arithmetic, Function Evaluation

Digit-at-a-Time Version of the Previous Example
================================ z = 118/ s (0) = z 2s (0) –(2q + 2–1) ––––––––––––––––––––––––––––––––––– s (1) q–1 = 1 q = .1 2s (1) –(2q + 2–2) s (2) q–2 = 0 q = .10 2s (2) +(2q – 2–3) s (3) q–3 = 1 q =.101 2s (3) –(2q + 2–4) s (4) q–4 = 0 q = .1010 ================================= In this example, z is ¼ of that in Fig Subtraction (addition) uses the term 2q + 2–i (2q – 2–i). Root digit Partial root May 2012 Computer Arithmetic, Function Evaluation

Square Rooting Is Not a Special Case of Division
Multiplier a x p = a x Multiplier x p = x x = x2 Multiplier, with both inputs connected to same value, becomes a squarer But, direct realization of squarer leads to simpler and faster circuit Divider z d q = z / d Divider z q = z / q = z1/2 Divider can’t be used as square-rooter via feedback connection Direct square-rooter realization does not lead to simpler or faster circuit May 2012 Computer Arithmetic, Function Evaluation

22 The CORDIC Algorithms Chapter Goals Learning a useful convergence method for evaluating trigonometric and other functions Chapter Highlights Basic CORDIC idea: rotate a vector with end point at (x,y) = (1,0) by the angle z to put its end point at (cos z, sin z) Other functions evaluated similarly Complexity comparable to division May 2012 Computer Arithmetic, Function Evaluation

The CORDIC Algorithms: Topics
Topics in This Chapter 22.1 Rotations and Pseudorotations 22.2 Basic CORDIC Iterations 22.3 CORDIC Hardware 22.4 Generalized CORDIC 22.5 Using the CORDIC Method 22.6 An Algebraic Formulation May 2012 Computer Arithmetic, Function Evaluation

22.1 Rotations and Pseudorotations
Evaluation of trigonometric, hyperbolic, and other common functions, such as log and exp, is needed in many computations It comes as a surprise to most people that such elementary functions can be evaluated in time that is comparable to division time or a fairly small multiple of it Some groups advocate including these functions in IEEE 754, thus requiring that they be evaluated exactly, except for the final rounding Progress has been made toward such properly rounded elementary functions, but the cost of achieving this goal is still prohibitive - CORDIC is a low-cost method that achieves the reasonable accuracy of about 1 ulp, but does not guarantee proper rounding May 2012 Computer Arithmetic, Function Evaluation

Key Ideas on which CORDIC Is Based
COordinate Rotation DIgital Computer used this method in the1950s; modern electronic calculators also use it If we have a computationally efficient way of rotating a vector, we can evaluate cos, sin, and tan–1 functions Rotation by an arbitrary angle is difficult, so we: Perform psuedorotations that require simpler operations Use special angles to synthesize the desired angle z z = a (1) + a (2) a (m) - May 2012 Computer Arithmetic, Function Evaluation

Rotating a Vector (x (i), y (i)) by the Angle a (i)
x(i+1) = x(i) cos a(i) – y(i) sin a(i) = (x(i) – y(i) tan a(i)) / (1 + tan2 a(i))1/2 y(i+1) = y(i) cos a(i) + x(i) sin a(i) = (y(i) + x(i) tan a(i)) / (1 + tan2 a(i))1/2 z(i+1) = z(i) – a(i) Recall that cos q = 1 / (1 + tan2 q)1/2 Our strategy: Eliminate the terms (1 + tan2 a(i))1/2 and choose the angles a(i)) so that tan a(i) is a power of 2; need two shift-adds Fig A pseudorotation step in CORDIC May 2012 Computer Arithmetic, Function Evaluation

Pseudorotating a Vector (x (i), y (i)) by the Angle a (i)
x(i+1) = x(i) – y(i) tan a(i) y(i+1) = y(i) + x(i) tan a(i) z(i+1) = z(i) – a(i) Pseudorotation: Whereas a real rotation does not change the length R(i) of the vector, a pseudorotation step increases its length to: R(i+1) = R(i) / cos a(i) = R(i) (1 + tan2 a(i))1/2 Fig A pseudorotation step in CORDIC May 2012 Computer Arithmetic, Function Evaluation

A Sequence of Rotations or Pseudorotations
x(m) = x cos(a(i)) – y sin(a(i)) y(m) = y cos(a(i)) + x sin(a(i)) z(m) = z – (a(i)) After m real rotations by a(1), a(2) , , a(m) , given x(0) = x, y(0) = y, and z(0) = z x(m) = K(x cos(a(i)) – y sin(a(i))) y(m) = K(y cos(a(i)) + x sin(a(i))) z(m) = z – (a(i)) where K = (1 + tan2 a(i))1/2 is a constant if angles of rotation are always the same, differing only in sign or direction After m pseudorotations by a(1), a(2) , , a(m) , given x(0) = x, y(0) = y, and z(0) = z a(1) a(2) a(3) Question: Can we find a set of angles so that any angle can be synthesized from all of them with appropriate signs? May 2012 Computer Arithmetic, Function Evaluation

22.2 Basic CORDIC Iterations
x(i+1) = x(i) – di y(i) 2–i y(i+1) = y(i) + di x(i) 2–i z(i+1) = z(i) – di tan –1 2–i = z(i) – di e(i) CORDIC iteration: In step i, we pseudorotate by an angle whose tangent is di 2–i (the angle e(i) is fixed, only direction di is to be picked) –––––––––––––––––––––––––––––––– i Table Value of the function e(i) = tan –1 2–i, in degrees and radians, for 0  i  9 e(i) in degrees (approximate) e(i) in radians (precise) Example: 30 angle 30.0  – – – – 0.2 + 0.1 = May 2012 Computer Arithmetic, Function Evaluation

Choosing the Angles to Force z to Zero
x(i+1) = x(i) – di y(i) 2–i y(i+1) = y(i) + di x(i) 2–i z(i+1) = z(i) – di tan –1 2–i = z(i) – di e(i) ––––––––––––––––––––––––––––––– i z(i) – di e(i) = z(i+1) +30.0 – = –15.0 1 – = +11.6 – = –2.4 3 – = +4.7 – = +1.1 – = –0.7 6 – = +0.2 – = –0.2 8 – = +0.0 – = –0.1 Fig The first three of 10 pseudorotations leading from (x(0), y(0)) to (x(10), 0) in rotating by +30. Table Choosing the signs of the rotation angles in order to force z to 0 May 2012 Computer Arithmetic, Function Evaluation

Why Any Angle Can Be Formed from Our List
Analogy: Paying a certain amount while using all currency denominations (in positive or negative direction) exactly once; red values are fictitious. $20 $10 $5 $3 $2 $1 $.50 $.25 $.20 $.10 $.05 $.03 $.02 $.01 Example: Pay $12.50 $20 – $10 + $5 – $3 + $2 – $1 – $.50 + $.25 – $.20 – $.10 + $.05 + $.03 – $.02 – $.01 Convergence is possible as long as each denomination is no greater than the sum of all denominations that follow it. Domain of convergence: –$42.16 to +$42.16 We can guarantee convergence with actual denominations if we allow multiple steps at some values: $20 $10 $5 $2 $2 $1 $.50 $.25 $.10 $.10 $.05 $.01 $.01 $.01 $.01 $20 – $10 + $5 – $2 – $2 + $1 + $.50+$.25–$.10–$.10–$.05+$.01–$.01+ $.01–$.01 We will see later that in hyperbolic CORDIC, convergence is guaranteed only if certain “angles” are used twice. May 2012 Computer Arithmetic, Function Evaluation

Angle Recoding The selection of angles during pseudorotations can be viewed as recoding the angle in a specific number system For example, an angle of 30 is recoded as the following digit string, with each digit being 1 or –1: 1 –1 1 – – – The money-exchange analogy also lends itself to this recoding view For example, a payment of $12.50 is recoded as: $20 $10 $5 $3 $2 $1 $.50 $.25 $.20 $.10 $.05 $.03 $.02 $.01 – – –1 – – – – –1 May 2012 Computer Arithmetic, Function Evaluation

Using CORDIC in Rotation Mode
x(i+1) = x(i) – di y(i) 2–i y(i+1) = y(i) + di x(i) 2–i z(i+1) = z(i) – di tan –1 2–i = z(i) – di e(i) x(m) = K(x cos z – y sin z) y(m) = K(y cos z + x sin z) z(m) = 0 where K = Make z converge to 0 by choosing di = sign(z(i)) Start with x = 1/K = and y = 0 to find cos z and sin z For k bits of precision in results, k CORDIC iterations are needed, because tan –1 2–i  2–I for large i Convergence of z to 0 is possible because each of the angles in our list is more than half the previous one or, equivalently, each is less than the sum of all the angles that follow it Domain of convergence is –99.7˚ ≤ z ≤ 99.7˚, where 99.7˚ is the sum of all the angles in our list; the domain contains [–/2, /2] radians May 2012 Computer Arithmetic, Function Evaluation

Using CORDIC in Vectoring Mode
x(i+1) = x(i) – di y(i) 2–i y(i+1) = y(i) + di x(i) 2–i z(i+1) = z(i) – di tan –1 2–i = z(i) – di e(i) x(m) = K(x2 + y2)1/2 y(m) = 0 z(m) = z + tan –1(y / x) where K = Make y converge to 0 by choosing di = – sign(x(i)y(i)) Start with x = 1 and z = 0 to find tan –1 y For k bits of precision in results, k CORDIC iterations are needed, because tan –1 2–i  2–I for large i Even though the computation above always converges, one can use the relationship tan –1(1/y ) = p/2 – tan –1y to limit the range of fixed-point numbers encountered Other trig functions: tan z obtained from sin z and cos z via division; inverse sine and cosine (sin –1 z and cos –1 z) discussed later May 2012 Computer Arithmetic, Function Evaluation

22.3 CORDIC Hardware x(i+1) = x(i) – di y(i) 2–i y(i+1) = y(i) + di x(i) 2–i z(i+1) = z(i) – di tan –1 2–i = z(i) – di e(i) If very high speed is not needed (as in a calculator), a single adder and one shifter would suffice k table entries for k bits of precision Fig Hardware elements needed for the CORDIC method. May 2012 Computer Arithmetic, Function Evaluation

22.4 Generalized CORDIC x(i+1) = x(i) – m di y(i) 2–i y(i+1) = y(i) + di x(i) 2–i z(i+1) = z(i) – di e(i) m = 1 Circular rotations (basic CORDIC) e(i) = tan –1 2–i m = 0 Linear rotations e(i) = 2–i m = –1 Hyperbolic rotations e(i) = tanh –1 2–i Fig Circular, linear, and hyperbolic CORDIC. May 2012 Computer Arithmetic, Function Evaluation

22.5 Using the CORDIC Method
x(i+1) = x(i) – m di y(i) 2–i y(i+1) = y(i) + di x(i) 2–i z(i+1) = z(i) – di e(i) m  {–1, 0, 1} di  {–1, 1} K = 1/K = K' = 1/K' = Fig Summary of generalized CORDIC algorithms. May 2012 Computer Arithmetic, Function Evaluation

CORDIC Speedup Methods
Skipping some rotations Must keep track of expansion via the recurrence: (K(i+1))2 = (K(i))2 (1 ± 2–2i) This additional work makes variable-factor CORDIC less cost-effective than constant-factor CORDIC x(i+1) = x(i) – m di y(i) 2–i y(i+1) = y(i) + di x(i) 2–i z(i+1) = z(i) – di e(i) Early termination Do the first k/2 iterations as usual, then combine the remaining k/2 into a single multiplicative step: For very small z, we have tan–1 z  z  tan z Expansion factor not an issue because contribution of the ignored terms is provably less than ulp x(k) = x(k/2) – y(k/2) z(k/2) y(k) = y(i) + x(k/2) z(k/2) z(k) = z(k/2) – z(k/2) High-radix CORDIC The hardware for the radix-4 version of CORDIC is quite similar to Fig. 22.3 di  {–2, –1, 1, 2} or {–2, –1, 0, 1, 2} May 2012 Computer Arithmetic, Function Evaluation

22.6 An Algebraic Formulation
Because cos z + j sin z = e jz where j = –1 cos z and sin z can be computed via evaluating the complex exponential function e jz This leads to an alternate derivation of CORDIC iterations Details in the text May 2012 Computer Arithmetic, Function Evaluation

23 Variations in Function Evaluation
Chapter Goals Learning alternate computation methods (convergence and otherwise) for some functions computable through CORDIC Chapter Highlights Reasons for needing alternate methods: Achieve higher performance or precision Allow speed/cost tradeoffs Optimizations, fit to diverse technologies May 2012 Computer Arithmetic, Function Evaluation

Variations in Function Evaluation: Topics
Topics in This Chapter 23.1 Normalization and Range Reduction 23.2 Computing Logarithms 23.3 Exponentiation 23.4 Division and Square-Rooting, Again 23.5 Use of Approximating Functions 23.6 Merged Arithmetic May 2012 Computer Arithmetic, Function Evaluation

23.1 Normalization and Range Reduction
u (i+1) = f(u (i), v (i)) v (i+1) = g(u (i), v (i)) u (i+1) = f(u (i), v (i), w (i)) v (i+1) = g(u (i), v (i), w (i)) w (i+1) = h(u (i), v (i), w (i)) Constant Desired function Guide the iteration such that one of the values converges to a constant (usually 0 or 1); this is known as normalization The other value then converges to the desired function Additive normalization: Normalize u via addition of terms to it Multiplicative normalization: Normalize u via multiplication of terms Additive normalization is more desirable, unless the multiplicative terms are of the form 1 ± 2a (shift-add) or multiplication leads to much faster convergence compared with addition May 2012 Computer Arithmetic, Function Evaluation

Convergence Methods You Already Know
x(i+1) = x(i) – m di y(i) 2–i y(i+1) = y(i) + di x(i) 2–i z(i+1) = z(i) – di e(i) CORDIC Example of additive normalization Force y or z to 0 by adding terms to it Force d to 1 by multiplying terms with it d (i+1) = d (i) (2 - d (i)) Set d (0) = d; iterate until d (m)  1 z (i+1) = z (i) (2 - d (i)) Set z (0) = z; obtain z/d = q  z (m) Division by repeated multiplications Example of multiplicative normalization May 2012 Computer Arithmetic, Function Evaluation

Range Reduction –2 –3/2 – –/2 /2  3/2 2 CORDIC’s conv. domain –99.7 to 99.7 cos(z – p) = –cos z cos(2jp + z) = cos z Adding p to the argument flips the function sign Subtracting multiples of 2p from the argument does not change the function value Must be careful: A slight error in the value of p is amplified when a large multiple of 2p is added to, or subtracted from, the argument Example: Compute cos(1.125  247) Additive range reduction: see the CORDIC example above Multiplicative range reduction: applicable to the log function, e.g. May 2012 Computer Arithmetic, Function Evaluation

23.2 Computing Logarithms di  {-1, 0, 1} x (i+1) = x (i) c (i) = x (i) (1 + di 2–i) y (i+1) = y (i) – ln c (i) = y (i) – ln(1 + di 2–i) Force x (m) to 1 y (m) converges to y + ln x Read out from table Why does this multiplicative normalization method work? x (m) = x c (i)   c (i)  1/x y (m) = y –  ln c (i) = y – ln (c (i)) = y – ln(1/x)  y + ln x Convergence domain: 1/(1 + 2–i)  x  1/(1 – 2–i) or  x  3.45 Number of iterations: k, for k bits of precision; for large i, ln(1  2–i)   2–i Use directly for x  [1, 2). For x = 2q s, we have: ln x = q ln 2 + ln s = q + ln s Radix-4 version can be devised May 2012 Computer Arithmetic, Function Evaluation

Computing Binary Logarithms via Squaring
For x  [1, 2), log2 x is a fractional number y = (. y–1y–2y– y–l)two x = 2y = 2 x 2 = 22y =  y–1 = 1 iff x 2  2 (. y–1y–2y– y–l)two (y–1. y–2y– y–l)two Once y–1 has been determined, if y–1 = 0, we are back at the original situation; otherwise, divide both sides of the equation above by 2 to get: x 2/2 = /2 = 2 (1 . y–2y– y–l)two (. y–2y– y–l)two Fig Hardware elements needed for computing log2 x. Generalization to base b: x = b y–1 = 1 iff x 2  b (. y–1y–2y– y–l)two May 2012 Computer Arithmetic, Function Evaluation

23.3 Exponentiation Computing ex Read out from table x (i+1) = x (i) – ln c (i) = x (i) – ln(1 + di 2–i) y (i+1) = y (i) c (i) = y (i) (1 + di 2–i) Force x (m) to 0 y (m) converges to y ex 1 di  {-1, 0, 1} Why does this additive normalization method work? x (m) = x –  ln c (i)    ln c (i)  x y (m) = y c (i) = y exp(ln c (i)) = y exp( ln c (i))  y ex Convergence domain:  ln (1 – 2–i)  x   ln (1 + 2–i) or –1.24  x  1.56 Number of iterations: k, for k bits of precision; for large i, ln(1  2–i)   2–i Can eliminate half the iterations because ln(1 + e) = e – e2/2 + e3/3 –  e for e2 < ulp and we may write y (k) = y (k/2) (1 + x (k/2)) Radix-4 version can be devised May 2012 Computer Arithmetic, Function Evaluation

General Exponentiation, or Computing xy
x y = (e ln x) y = e y ln x So, compute natural log, multiply, exponentiate Method is prone to inaccuracies When y is an integer, we can exponentiate by repeated multiplication (need to consider only positive y; for negative y, compute reciprocal) In particular, when y is a constant, the methods used are reminiscent of multiplication by constants (Section 9.5) Example: x 25 = ((((x)2x)2)2)2x [4 squarings and 2 multiplications] Noting that 25 = ( )two, leads to a general procedure Computing x y, when y is an unsigned integer Initialize the partial result to 1 Scan the binary representation of y, starting at its MSB, and repeat If the current bit is 1, multiply the partial result by x If the current bit is 0, do not change the partial result Square the partial result before the next step (if any) May 2012 Computer Arithmetic, Function Evaluation

Faster Exponentiation via Recoding
Example: x 31 = ((((x)2x)2x)2x)2x [4 squarings and 4 multiplications] Note that 31 = ( )two = ( )two x 31 = (((((x)2)2)2)2)2 / x [5 squarings and 1 division] Computing x y, when y is an integer encoded in BSD format Initialize the partial result to 1 Scan the binary representation of y, starting at its MSB, and repeat If the current digit is 1, multiply the partial result by x If the current digit is 0, do not change the partial result If the current digit is -1, divide the partial result by x Square the partial result before the next step (if any) Radix-4 example: 31 = ( )two = ( )two = ( )four x 31 = (((x2)4)4 / x [Can you formulate the general procedure?] May 2012 Computer Arithmetic, Function Evaluation

23.4 Division and Square-Rooting, Again
In digit-recurrence division, g (i) is the next quotient digit and the addition for q turns into concatenation; more generally, g (i) can be any estimate for the difference between the partial quotient q (i) and the final quotient q Computing q = z / d s (i+1) = s (i) – g (i) d q (i+1) = q (i) + g (i) Because s (i) becomes successively smaller as it converges to 0, scaled versions of the recurrences above are usually preferred. In the following, s (i) stands for s (i) r i and q (i) for q (i) r i : s (i+1) = rs (i) – g (i) d Set s (0) = z and keep s (i) bounded q (i+1) = rq (i) + g (i) Set q (0) = 0 and find q * = q (m) r –m In the scaled version, g (i) is an estimate for r (r i–m q – q (i)) = r (r i q * - q (i)), where q * = r –m q represents the true quotient May 2012 Computer Arithmetic, Function Evaluation

Square-Rooting via Multiplicative Normalization
Idea: If z is multiplied by a sequence of values (c (i))2, chosen so that the product z (c (i))2 converges to 1, then z c (i) converges to z x (i+1) = x (i) (1 + di 2–i)2 = x (i) (1 + 2di 2–i + di2 2–2i) x (0) = z, x (m)  1 y (i+1) = y (i) (1 + di 2–i) y (0) = z, y (m)  z What remains is to devise a scheme for choosing di values in {–1, 0, 1} di = 1 for x (i) < 1 – e = 1 – a2–i di = –1 for x (i) > 1 + e = 1 + a2–i To avoid the need for comparison with a different constant in each step, a scaled version of the first recurrence is used in which u (i) = 2i (x (i) – 1): u (i+1) = 2(u (i) + 2di) + 2–i+1(2di u (i) + di2) + 2–2i+1di2 u (i) u (0) = z – 1, u (m)  0 y (i+1) = y (i) (1 + di 2–i) y (0) = z, y (m)  z Radix-4 version can be devised: Digit set [–2, 2] or {–1, –½, 0, ½, 1} May 2012 Computer Arithmetic, Function Evaluation

Square-Rooting via Additive Normalization
Idea: If a sequence of values c (i) can be obtained such that z – (c (i))2 converges to 0, then  c (i) converges to z x (i+1) = z – (y (i+1))2 = z – (y (i) + c (i))2 = x (i) + 2di y (i) 2–i – di2 2–2i x (0) = z, x (m)  0 y (i+1) = y (i) + c (i) = y (i) – di 2–i y (0) = 0, y (m)  z What remains is to devise a scheme for choosing di values in {–1, 0, 1} di = 1 for x (i) < – e = – a2–i di = –1 for x (i) > + e = + a2–i To avoid the need for comparison with a different constant in each step, a scaled version of the first recurrence may be used in which u (i) = 2i x (i): u (i+1) = 2(u (i) + 2di y (i) – di2 2–i ) u (0) = z , u (i) bounded y (i+1) = y (i) – di 2–i y (0) = 0, y (m)  z Radix-4 version can be devised: Digit set [–2, 2] or {–1, –½, 0, ½, 1} May 2012 Computer Arithmetic, Function Evaluation

23.5 Use of Approximating Functions
Convert the problem of evaluating the function f to that of function g approximating f, perhaps with a few pre- and postprocessing operations Approximating polynomials need only additions and multiplications Polynomial approximations can be derived from various schemes The Taylor-series expansion of f(x) about x = a is f(x) =  j=0 to  f (j) (a) (x – a) j / j! The error due to omitting terms of degree > m is: f (m+1) (a + m(x – a)) (x – a)m+1 / (m + 1)! 0 < m < 1 Setting a = 0 yields the Maclaurin-series expansion f(x) =  j=0 to  f (j) (0) x j / j! and its corresponding error bound: f (m+1) (mx) xm+1 / (m + 1)! 0 < m < 1 Efficiency in computation can be gained via Horner’s method and incremental evaluation May 2012 Computer Arithmetic, Function Evaluation

Some Polynomial Approximations (Table 23.1)
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Func Polynomial approximation Conditions 1/x 1 + y + y 2 + y y i < x < 2, y = 1 – x ex 1 + x /1! + x 2/2! + x 3/3! x i /i ! ln x –y – y 2/2 – y 3/3 – y 4/4 – – y i /i – < x  2, y = 1 – x ln x 2 [z + z 3/3 + z 5/ z 2i+1/(2i + 1) ] x > 0, z = x–1x+1 sin x x – x 3/3! + x 5/5! – x 7/7! (–1)i x2i+1/(2i + 1)! cos x 1 – x 2/2! + x 4/4! – x 6/6! (–1)i x2i /(2i )! tan–1 x x – x 3/3 + x 5/5 – x 7/ (–1)i x2i+1/(2i + 1) –1 < x < 1 sinh x x + x 3/3! + x 5/5! + x 7/7! x2i+1/(2i + 1)! cosh x 1 + x 2/2! + x 4/4! + x 6/6! x2i /(2i )! tanh–1x x + x 3/3 + x 5/5 + x 7/ x2i+1/(2i + 1) –1 < x < 1 May 2012 Computer Arithmetic, Function Evaluation

Function Evaluation via Divide-and-Conquer
Let x in [0, 4) be the (l + 2)-bit significand of a floating-point number or its shifted version. Divide x into two chunks x H and x L: x = x H + 2–t x L 0  x H < t + 2 bits 0  x L < l – t bits t bits x H in [0, 4) x L in [0, 1) The Taylor-series expansion of f(x) about x = x H is f(x) =  j=0 to  f (j) (x H) (2–t x L) j / j! A linear approximation is obtained by taking only the first two terms f(x)  f (x H) + 2–t x L f (x H) If t is not too large, f and/or f (and other derivatives of f, if needed) can be evaluated via table lookup May 2012 Computer Arithmetic, Function Evaluation

Approximation by the Ratio of Two Polynomials
Example, yielding good results for many elementary functions f(x)  Using Horner’s method, such a “rational approximation” needs 10 multiplications, 10 additions, and 1 division May 2012 Computer Arithmetic, Function Evaluation

23.6 Merged Arithmetic Our methods thus far rely on word-level building-block operations such as addition, multiplication, shifting, . . . Sometimes, we can compute a function of interest directly without breaking it down into conventional operations Example: merged arithmetic for inner product computation z = z (0) + x (1) y (1) + x (2) y (2) + x (3) y (3) z(0)             x(1) y(1) x(2) y(2) Fig Merged-arithmetic computation of an inner product followed by accumulation. x(3) y(3) May 2012 Computer Arithmetic, Function Evaluation

Example of Merged Arithmetic Implementation
            x(1) y(1) x(3) y(3) x(2) y(2) z(0) Example: Inner product computation z = z (0) + x (1) y (1) + x (2) y (2) + x (3) y (3) Fig. 23.2 Fig Tabular representation of the dot matrix for inner-product computation and its reduction. FAs FAs + 1 HA FAs FAs + 1 HA FAs + 2 HAs bit CPA May 2012 Computer Arithmetic, Function Evaluation

Another Merged Arithmetic Example
Approximation of reciprocal (1/x) and reciprocal square root (1/x) functions with bits of precision, so that a long floating-point result can be obtained with just one iteration at the end [Pine02] f(x) = c + bv + av 2 2 adds 2 mult’s 1 square Comparable to a multiplier May 2012 Computer Arithmetic, Function Evaluation

24 Arithmetic by Table Lookup
Chapter Goals Learning table lookup techniques for flexible and dense VLSI realization of arithmetic functions Chapter Highlights We have used tables to simplify or speedup q digit selection, convergence methods, . . . Now come tables as primary computational mechanisms (as stars, not supporting cast) May 2012 Computer Arithmetic, Function Evaluation

Arithmetic by Table Lookup: Topics
Topics in This Chapter 24.1 Direct and Indirect Table Lookup 24.2 Binary-to-Unary Reduction 24.3 Tables in Bit-Serial Arithmetic 24.4 Interpolating Memory 24.5 Piecewise Lookup Tables 24.6 Multipartite Table Methods May 2012 Computer Arithmetic, Function Evaluation

24.1 Direct and Indirect Table Lookup
Fig Direct table lookup versus table-lookup with pre- and post-processing. May 2012 Computer Arithmetic, Function Evaluation

Tables in Supporting and Primary Roles
Tables are used in two ways: In supporting role, as in initial estimate for division As main computing mechanism Boundary between two uses is fuzzy Pure logic Hybrid solutions Pure tabular Previously, we started with the goal of designing logic circuits for particular arithmetic computations and ended up using tables to facilitate or speed up certain steps Here, we aim for a tabular implementation and end up using peripheral logic circuits to reduce the table size Some solutions can be derived starting at either endpoint May 2012 Computer Arithmetic, Function Evaluation

24.2 Binary-to-Unary Reduction
Strategy: Reduce the table size by using an auxiliary unary function to evaluate a desired binary function Example 1: Addition/subtraction in a logarithmic number system; i.e., finding Lz = log(x  y), given Lx and Ly Solution: Let  = Ly – Lx Lz = log(x  y) = log(x (1  y/x)) = log x + log(1  y/x) = Lx + log(1  log –1) Pre-process f+ table f- table Postprocess Lx Ly Lz  = Ly – Lx Lx + f+() Lx + f-() May 2012 Computer Arithmetic, Function Evaluation

Another Example of Binary-to-Unary Reduction
Example 2: Multiplication via squaring, xy = (x + y)2/4 – (x – y)2/4 Simplification and implementation details If x and y are k bits wide, x + y and x – y are k + 1 bits wide, leading to two tables of size 2k+1  2k (total table size = 2k+3  k bits) (x  y)/2 = (x  y)/2 + /   {0, 1} is the LSB (x + y)2/4 – (x – y)2/4 = [ (x + y)/2 + /2] 2 – [ (x – y)/2 + /2] 2 = (x + y)/2 2 – (x – y)/2 2 + y Pre-process: compute x + y and x – y; drop their LSBs Table lookup: consult two squaring table(s) of size 2k  (2k – 1) Post-process: carry-save adder, followed by carry-propagate adder (table size after simplification = 2k+1  (2k – 1)  2k+2  k bits) Can be realized with one adder and one table x x + y Preprocess (two adds) y x – y Square table Square table Postprocess (add) xy Fig. 24.2 May 2012 Computer Arithmetic, Function Evaluation

24.3 Tables in Bit-Serial Arithmetic
Specified by 16-bit addresses Specified by 2-bit address Replaces a in memory 8-bit opcode (f truth table) (g truth table) 3 bits specify a flag and a value to conditionalize the operation (64 Kb) 1 Carry bit for addition 1 Sum bit for addition Fig Bit-serial ALU with two tables implemented as multiplexers. Used in Connection Machine 2, an MPP introduced in 1987 May 2012 Computer Arithmetic, Function Evaluation

Other Table-Based Bit-Serial Arithmetic Examples
Modular accumulator x 0 x 1 x 2 xk–2 . x mod m xk–1 . . . See Section 4.3: Conversion from binary/decimal to RNS Evaluation of linear expressions (assume unsigned values) z = ax + by = a  xi 2i + b  yi 2i =  (axi+ byi) 2i Address 4-entry table b a + b a xi yi Sum Carry CSA Data k / k–1 LSB zi CS residual Fig Bit-serial evaluation of z = ax + by. May 2012 Computer Arithmetic, Function Evaluation

24.4 Interpolating Memory Linear interpolation: Computing f(x), x  [xlo, xhi], from f(xlo) and f(xhi) x – xlo f (x) = f (xlo) [ f (xhi) – f (xlo) ] adds, 1 divide, 1 multiply xhi – xlo If the xlo and xhi endpoints are consecutive multiples of a power of 2, the division and two of the additions become trivial Example: Evaluating log2 x for x  [1, 2) f(xlo) = log2 1 = 0, f(xhi) = log2 2 = 1; thus: log2 x  x – 1 = Fractional part of x An improved linear interpolation formula ln 2 – ln(ln 2) – 1 log2 x  (x – 1) = x 2 ln 2 1 1 2 May 2012 Computer Arithmetic, Function Evaluation

Hardware Linear Interpolation Scheme
Fig Linear interpolation for computing f(x) and its hardware realization. May 2012 Computer Arithmetic, Function Evaluation

Linear Interpolation with Four Subintervals
Fig Linear interpolation for computing f(x) using 4 subintervals. –––––––––––––––––––––––––––––––––––––––––––––––– i xlo xhi a (i) b (i)/ Max error     Table Approximating log2 x for x in [1, 2) using linear interpolation within 4 subintervals. May 2012 Computer Arithmetic, Function Evaluation

Tradeoffs in Cost, Speed, and Accuracy
Fig Maximum absolute error in computing log2 x as a function of number h of address bits for the tables with linear, quadratic (second-degree), and cubic (third-degree) interpolations [Noet89]. May 2012 Computer Arithmetic, Function Evaluation

Interpolation with Nonuniform Intervals
One way to use interpolation with nonuniform intervals to successively divide ranges and subranges of interest into 2 parts, with finer divisions used where the function exhibits greater curvature (nonlinearity) In this way, a number of leading bits can be used to decide which subrange is applicable 1 .0xx .10x .111 .110 The [0,1) range divided into 4 nonuniform intervals May 2012 Computer Arithmetic, Function Evaluation

24.5 Piecewise Lookup Tables
To compute a function of a short (single) IEEE floating-point number: Divide the 26-bit significand x (2 whole + 24 fractional bits) into 4 sections x = t + lu + l2v + l3w = t + 2–6u + 2–12v + 2–18w where u, v, w are 6-bit fractions in [0, 1) and t, with up to 8 bits, is in [0, 4) Taylor polynomial for f(x): f(x) = i=0 to  f (i) (t + lu) (l2v + l3w)i / i ! Ignore terms smaller than l5 = 2–30 f(x)  f(t + lu) + (l/2) [f(t + lu + lv) – f(t + lu – lv)] + (l2/2) [f(t + lu + lw) – f(t + lu – lw)] + l4 [(v 2/2) f (2)(t) – (v 3/6) f (3)(t)] t u v w Use 4 additions to form these terms Read 5 values of f from tables Read this last term from a table Perform 6-operand addition May 2012 Computer Arithmetic, Function Evaluation

Modular Reduction, or Computing z mod p
Divide the argument z into a (b – g)-bit upper part (x) and a g-bit lower part (y), where x ends with g zeros (x + y) mod p = (x mod p + y mod p) mod p Fig. 24.8a Two-table modular reduction scheme based on divide-and-conquer. May 2012 Computer Arithmetic, Function Evaluation

Another Two-Table Modular Reduction Scheme
Divide the argument z into a (b – h)-bit upper part (x) and an h-bit lower part (y), where x ends with h zeros Explanation to be added Fig. 24.8b Modular reduction based on successive refinement. May 2012 Computer Arithmetic, Function Evaluation

24.6 Multipartite Table Methods
k-bit input x Add u Table v Table b k–a–b k-bit output y x0 x1 x2 u(x0, x1) v(x0, x2) f(x) Subintervals An interval f(x) x (a) Hardware realization (b) Linear approximation Common Slope Divide the domain of interest into 2a intervals, each of which is further divided into 2b smaller subintervals The trick: Use linear interpolation with an initial value determined for each subinterval and a common slope for each larger interval Fig The bipartite table method. Total table size is 2a+b + 2k–b, in lieu of 2k; width of table entries has been ignored in this comparison May 2012 Computer Arithmetic, Function Evaluation

Generalizing to Tripartite and Higher-Order Tables
Two-part tables have been generalized to multipart (3-part, 4-part, ) tables Source of figure: May 2012 Computer Arithmetic, Function Evaluation

Part VII Implementation Topics
28. Reconfigurable Arithmetic Appendix: Past, Present, and Future May 2010 Computer Arithmetic, Implementation Topics

This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN ). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. © Behrooz Parhami Edition Released Revised First Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 Dec. 2007 Second May 2010 May 2010 Computer Arithmetic, Implementation Topics

VII Implementation Topics
Sample advanced implementation methods and tradeoffs Speed / latency is seldom the only concern We also care about throughput, size, power/energy Fault-induced errors are different from arithmetic errors Implementation on Programmable logic devices Topics in This Part Chapter 25 High-Throughput Arithmetic Chapter 26 Low-Power Arithmetic Chapter 27 Fault-Tolerant Arithmetic Chapter 28 Reconfigurable Arithmetic May 2010 Computer Arithmetic, Implementation Topics

Computer Arithmetic, Implementation Topics
May 2010 Computer Arithmetic, Implementation Topics

25 High-Throughput Arithmetic
Chapter Goals Learn how to improve the performance of an arithmetic unit via higher throughput rather than reduced latency Chapter Highlights To improve overall performance, one must  Look beyond individual operations  Trade off latency for throughput For example, a multiply may take 20 cycles, but a new one can begin every cycle Data availability and hazards limit the depth May 2010 Computer Arithmetic, Implementation Topics

High-Throughput Arithmetic: Topics
Topics in This Chapter 25.1 Pipelining of Arithmetic Functions 25.2 Clock Rate and Throughput 25.3 The Earle Latch 25.4 Parallel and Digit-Serial Pipelines 25.5 On-Line of Digit-Pipelined Arithmetic 25.6 Systolic Arithmetic Units May 2010 Computer Arithmetic, Implementation Topics

25.1 Pipelining of Arithmetic Functions
Fig An arithmetic function unit and its s-stage pipelined version. Throughput Operations per unit time Pipelining period Interval between applying successive inputs Latency, though a secondary consideration, is still important because: a. Occasional need for doing single operations b. Dependencies may lead to bubbles or even drainage At times, pipelined implementation may improve the latency of a multistep computation and also reduce its cost; in this case, advantage is obvious May 2010 Computer Arithmetic, Implementation Topics

Analysis of Pipelining Throughput
Consider a circuit with cost (gate count) g and latency t Simplifying assumptions for our analysis: 1. Time overhead per stage is t (latching delay) 2. Cost overhead per stage is g (latching cost) 3. Function is divisible into s equal stages for any s Then, for the pipelined implementation: Latency T = t + st 1 1 Throughput R = = T / s t / s + t Cost G = g + sg Throughput approaches its maximum of 1/t for large s Fig. 25.1 May 2010 Computer Arithmetic, Implementation Topics

Analysis of Pipelining Cost-Effectiveness
1 1 T = t + st R = = G = g + sg T / s t / s + t Latency Throughput Cost Consider cost-effectiveness to be throughput per unit cost E = R / G = s / [(t + st)(g + sg)] To maximize E, compute dE/ds and equate the numerator with 0 tg – s2tg = 0  sopt =  tg / (tg) We see that the most cost-effective number of pipeline stages is: Directly related to the latency and cost of the function; it pays to have many stages if the function is very slow or complex Inversely related to pipelining delay and cost overheads; few stages are in order if pipelining overheads are fairly high All in all, not a surprising result! May 2010 Computer Arithmetic, Implementation Topics

25.2 Clock Rate and Throughput
Consider a s-stage pipeline with stage delay tstage One set of inputs is applied to the pipeline at time t1 At time t1 + tstage + t, partial results are safely stored in latches Apply the next set of inputs at time t2 satisfying t2  t1 + tstage + t Therefore: Clock period = t = t2 – t1  tstage + t Throughput = 1/ Clock period  1/(tstage + t) Fig. 25.1 May 2010 Computer Arithmetic, Implementation Topics

The Effect of Clock Skew on Pipeline Throughput
Two implicit assumptions in deriving the throughput equation below: One clock signal is distributed to all circuit elements All latches are clocked at precisely the same time Throughput = 1/ Clock period  1/(tstage + t) Fig. 25.1 Uncontrolled or random clock skew causes the clock signal to arrive at point B before/after its arrival at point A With proper design, we can place a bound ±e on the uncontrolled clock skew at the input and output latches of a pipeline stage Then, the clock period is lower bounded as: Clock period = t = t2 – t1  tstage + t + 2e May 2010 Computer Arithmetic, Implementation Topics

Wave Pipelining: The Idea
The stage delay tstage is really not a constant but varies from tmin to tmax tmin represents fast paths (with fewer or faster gates) tmax represents slow paths Suppose that one set of inputs is applied at time t1 At time t1 + tmax + t, the results are safely stored in latches If that the next inputs are applied at time t2, we must have: t2 + tmin  t1 + tmax + t This places a lower bound on the clock period: Clock period = t = t2 – t1  tmax – tmin + t Thus, we can approach the maximum possible pipeline throughput of 1/t without necessarily requiring very small stage delay All we need is a very small delay variance tmax – tmin Two roads to higher pipeline throughput: Reducing tmax Increasing tmin May 2010 Computer Arithmetic, Implementation Topics

Visualizing Wave Pipelining
Fig Wave pipelining allows multiple computational wavefronts to coexist in a single pipeline stage . May 2010 Computer Arithmetic, Implementation Topics

Another Visualization of Wave Pipelining
(a) Ordinary pipelining (b) Wave pipelining Transient region (shaded) Stationary (unshaded) Fig Alternate view of the throughput advantage of wave pipelining over ordinary pipelining. May 2010 Computer Arithmetic, Implementation Topics

Difficulties in Applying Wave Pipelining
LAN and other high-speed links (figures rounded from Myrinet data [Bode95]) Gb/s throughput  Clock rate =  Clock cycle = 10 ns In 10 ns, signals travel m (speed of light = 0.3 m/ns) For a 30 m cable, characters will be in flight at the same time At the circuit and logic level (m-mm distances, not m), there are still problems to be worked out For example, delay equalization to reduce tmax – tmin is nearly impossible in CMOS technology: CMOS 2-input NAND delay varies by factor of 2 based on inputs Biased CMOS (pseudo-CMOS) fairs better, but has power penalty May 2010 Computer Arithmetic, Implementation Topics

Controlled Clock Skew in Wave Pipelining
With wave pipelining, a new input enters the pipeline stage every t time units and the stage latency is tmax + t Thus, for proper sampling of the results, clock application at the output latch must be skewed by (tmax + t) mod t Example: tmax + t = 12 ns; t = 5 ns A clock skew of +2 ns is required at the stage output latches relative to the input latches In general, the value of tmax – tmin > 0 may be different for each stage t  maxi=1 to s [tmax(i) – tmin(i) + t] The controlled clock skew at the output of stage i needs to be: S(i) = j=1 to i [tmax(i) – tmin(i) + t] mod t May 2010 Computer Arithmetic, Implementation Topics

Random Clock Skew in Wave Pipelining
Clock period = t = t2 – t1  tmax – tmin + t + 4e Reasons for the term 4e: Clocking of the first input set may lag by e, while that of the second set leads by e (net difference = 2e) The reverse condition may exist at the output side Uncontrolled skew has a larger effect on wave pipelining than on standard pipelining, especially when viewed in relative terms Graphical justification of the term 4e May 2010 Computer Arithmetic, Implementation Topics

25.3 The Earle Latch Earle latch can be merged with a preceding 2-level AND-OR logic Fig Two-level AND-OR latched realization of the function z = vw + xy. Fig Two-level AND-OR realization of the Earle latch. Example: To latch d = vw + xy, substitute for d in the latch equation z = dC + dz +`Cz to get a combined “logic + latch” circuit implementing z = vw + xy z = (vw + xy)C + (vw + xy)z +`Cz = vwC + xyC + vwz + xyz +`Cz May 2010 Computer Arithmetic, Implementation Topics

Clocking Considerations for Earle Latches
We derived constraints on the maximum clock rate 1/t Clock period t has two parts: clock high, and clock low t = Chigh + Clow Consider a pipeline stage between Earle latches Chigh must satisfy the inequalities 3dmax – dmin + Smax(C,`C)  Chigh  2dmin + tmin dmax and dmin are maximum and minimum gate delays Smax(C,`C)  0 is the maximum skew between C and`C May 2010 Computer Arithmetic, Implementation Topics

25.4 Parallel and Digit-Serial Pipelines
Fig Flow-graph representation of an arithmetic expression and timing diagram for its evaluation with digit-parallel computation. May 2010 Computer Arithmetic, Implementation Topics

Feasibility of Bit-Level or Digit-Level Pipelining
Bit-serial addition and multiplication can be done LSB-first, but division and square-rooting are MSB-first operations Besides, division can’t be done in pipelined bit-serial fashion, because the MSB of the quotient q in general depends on all the bits of the dividend and divisor Example: Consider the decimal division .1234/.2469 .1xxx = .?xxx .2xxx .12xx = .?xxx .24xx .123x = .?xxx .246x Solution: Redundant number representation! May 2010 Computer Arithmetic, Implementation Topics

25.5 On-Line or Digit-Pipelined Arithmetic
Fig Digit-parallel versus digit-pipelined computation. May 2010 Computer Arithmetic, Implementation Topics

Digit-Pipelined Adders
Fig. 25.8 Digit-pipelined MSD-first carry-free addition. Fig. 25.9 Digit-pipelined MSD-first limited-carry addition. May 2010 Computer Arithmetic, Implementation Topics

Digit-Pipelined Multiplier: Algorithm Visualization
Fig Digit-pipelined MSD-first multiplication process. May 2010 Computer Arithmetic, Implementation Topics

Digit-Pipelined Multiplier: BSD Implementation
Fig Digit-pipelined MSD-first BSD multiplier. May 2010 Computer Arithmetic, Implementation Topics

Digit-Pipelined Divider
Table Example of digit-pipelined division showing that three cycles of delay are necessary before quotient digits can be output (radix = 4, digit set = [–2, 2]) –––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Cycle Dividend Divisor q Range q–1 Range 1 ( )four ( )four (–2/3, 2/3) [–2, 2] 2 ( )four (.1– )four (–2/4, 2/4) [–2, 2] 3 ( )four (.1–2– )four (1/16, 5/16) [0, 1] 4 ( )four (.1–2–2– )four (10/64, 14/64) 1 May 2010 Computer Arithmetic, Implementation Topics

Digit-Pipelined Square-Rooter
Table Examples of digit-pipelined square-root computation showing that 1-2 cycles of delay are necessary before root digits can be output (radix = 10, digit set = [–6, 6], and radix = 2, digit set = [–1, 1]) ––––––––––––––––––––––––––––––––––––––––––––––––––––––– Cycle Radicand q Range q–1 Range 1 ( )ten ( 7/30 ,  11/30 ) [5, 6] 2 ( )ten ( 1/3 ,  26/75 ) 6 1 ( )two (0,  1/2 ) [–2, 2] 2 ( )two (0,  1/2 ) [0, 1] 3 ( )two (1/2,  1/2 ) 1 May 2010 Computer Arithmetic, Implementation Topics

Digit-Pipelined Arithmetic: The Big Picture
Output already produced Residual Processed input parts Unprocessed On-line arithmetic unit Fig Conceptual view of on-Line or digit-pipelined arithmetic. May 2010 Computer Arithmetic, Implementation Topics

25.6 Systolic Arithmetic Units
Systolic arrays: Cellular circuits in which data elements Enter at the boundaries Advance from cell to cell in lock step Are transformed in an incremental fashion Leave from the boundaries Systolic design mitigates the effect of signal propagation delay and allows the use of very clock rates Fig High-level design of a systolic radix-4 digit-pipelined multiplier. May 2010 Computer Arithmetic, Implementation Topics

Case Study: Systolic Programmable FIR Filters
(a) Conventional: Broadcast control, broadcast data (b) Systolic: Pipelined control, pipelined data Fig Conventional and systolic realizations of a programmable FIR filter. May 2010 Computer Arithmetic, Implementation Topics

26 Low-Power Arithmetic Chapter Goals Learn how to improve the power efficiency of arithmetic circuits by means of algorithmic and logic design strategies Chapter Highlights Reduced power dissipation needed due to  Limited source (portable, embedded)  Difficulty of heat disposal Algorithm and logic-level methods: discussed Technology and circuit methods: ignored here May 2010 Computer Arithmetic, Implementation Topics

Low-Power Arithmetic: Topics
Topics in This Chapter 26.1 The Need for Low-Power Design 26.2 Sources of Power Consumption 26.3 Reduction of Power Waste 26.4 Reduction of Activity 26.5 Transformations and Tradeoffs 26.6 New and Emerging Methods May 2010 Computer Arithmetic, Implementation Topics

26.1 The Need for Low-Power Design
Portable and wearable electronic devices Lithium-ion batteries: 0.2 watt-hr per gram of weight Practical battery weight < 500 g (< 50 g if wearable device) Total power  5-10 watt for a day’s work between recharges Modern high-performance microprocessors use 100s watts Power is proportional to die area  clock frequency Cooling of micros difficult, but still manageable Cooling of MPPs and server farms is a BIG challenge New battery technologies cannot keep pace with demand Demand for more speed and functionality (multimedia, etc.) - May 2010 Computer Arithmetic, Implementation Topics

Processor Power Consumption Trends
The factor-of-100 improvement per decade in energy efficiency has been maintained since 2000 Power consumption per MIPS (W) Fig Power consumption trend in DSPs [Raba98]. May 2010 Computer Arithmetic, Implementation Topics

26.2 Sources of Power Consumption
Both average and peak power are important Average power determines battery life or heat dissipation Peak power impacts power distribution and signal integrity Typically, low-power design aims at reducing both Power dissipation in CMOS digital circuits Static: Leakage current in imperfect switches (< 10%) Dynamic: Due to (dis)charging of parasitic capacitance Pavg  a f C V 2 Square of voltage “activity” data rate (clock frequency) Capacitance May 2010 Computer Arithmetic, Implementation Topics

Power Reduction Strategies: The Big Picture
For a given data rate f, there are but 3 ways to reduce the power requirements: 1. Using a lower supply voltage V 2. Reducing the parasitic capacitance C 3. Lowering the switching activity a Pavg  a f C V 2 Example: A 32-bit off-chip bus operates at 5 V and 100 MHz and drives a capacitance of 30 pF per bit. If random values were put on the bus in every cycle, we would have a = 0.5. To account for data correlation and idle bus cycles, assume a = 0.2. Then: Pavg  a f C V 2 = 0.2  108  (32  30  10–12)  52 = 0.48 W May 2010 Computer Arithmetic, Implementation Topics

26.3 Reduction of Power Waste
Fig Saving power through clock gating. Fig Saving power via guarded evaluation. May 2010 Computer Arithmetic, Implementation Topics

Glitching and Its Impact on Power Waste
Fig Example of glitching in a ripple-carry adder. May 2010 Computer Arithmetic, Implementation Topics

Array Multipliers with Lower Power Consumption
Fig An array multiplier with gated FA cells. May 2010 Computer Arithmetic, Implementation Topics

26.4 Reduction of Activity Fig Reduction of activity by precomputation. Fig Reduction of activity via Shannon expansion. May 2010 Computer Arithmetic, Implementation Topics

26.5 Transformations and Tradeoffs
Fig Reduction of power via parallelism or pipelining. May 2010 Computer Arithmetic, Implementation Topics

Unrolling of Iterative Computations
(a) Simple (b) Unrolled once Fig Realization of a first-order IIR filter. May 2010 Computer Arithmetic, Implementation Topics

Retiming for Power Efficiency
(a) Original (b) Retimed Fig Possible realizations of a fourth-order FIR filter. May 2010 Computer Arithmetic, Implementation Topics

26.6 New and Emerging Methods
Dual-rail data encoding with transition signaling: Two wires per signal Transition on wire 0 (1) indicates the arrival of 0 (1) Dual-rail design does increase the wiring density, but it offers the advantage of complete insensitivity to delays Fig Part of an asynchronous chain of computations. May 2010 Computer Arithmetic, Implementation Topics

The Ultimate in Low-Power Design
B C P = A Q = B R = A B  C TG (a) Toffoli gate FRG (b) Fredkin gate Q = A  B FG (c) Feynman gate PG (d) Peres gate R = A C  A B Q = A B  A C Fig Some reversible logic gates. B C 1 + A Cout s (sum) G s Fig Reversible binary full adder built of 5 Fredkin gates, with a single Feynman gate used to fan out the input B. The label “G” denotes “garbage.” May 2010 Computer Arithmetic, Implementation Topics

27 Fault-Tolerant Arithmetic
Chapter Goals Learn about errors due to hardware faults or hostile environmental conditions, and how to deal with or circumvent them Chapter Highlights Modern components are very robust, but . . . put millions / billions of them together and something is bound to go wrong Can arithmetic be protected via encoding? Reliable circuits and robust algorithms May 2010 Computer Arithmetic, Implementation Topics

Fault-Tolerant Arithmetic: Topics
Topics in This Chapter 27.1 Faults, Errors, and Error Codes 27.2 Arithmetic Error-Detecting Codes 27.3 Arithmetic Error-Correcting Codes 27.4 Self-Checking Function Units 27.5 Algorithm-Based Fault Tolerance 27.6 Fault-Tolerant RNS Arithmetic May 2010 Computer Arithmetic, Implementation Topics

27.1 Faults, Errors, and Error Codes
Fig. 27.1 A common way of applying information coding techniques. May 2010 Computer Arithmetic, Implementation Topics

Fault Detection and Fault Masking
(a) Duplication and comparison (b) Triplication and voting Fig Arithmetic fault detection or fault tolerance (masking) with replicated units. May 2010 Computer Arithmetic, Implementation Topics

Inadequacy of Standard Error Coding Methods
Unsigned addition ––––––––––––––––– Correct sum Erroneous sum  Stage generating an erroneous carry of 1 Fig How a single carry error can produce an arbitrary number of bit-errors (inversions). The arithmetic weight of an error: Min number of signed powers of 2 that must be added to the correct value to produce the erroneous result Example 1 Example 2 Correct value Erroneous value Difference (error) 16 = 24 –32752 = – Min-weight BSD – Arithmetic weight Error type Single, positive Double, negative May 2010 Computer Arithmetic, Implementation Topics

27.2 Arithmetic Error-Detecting Codes
Are characterized by arithmetic weights of detectable errors Allow direct arithmetic on coded operands We will discuss two classes of arithmetic error-detecting codes, both of which are based on a check modulus A (usually a small odd number) Product or AN codes Represent the value N by the number AN Residue (or inverse residue) codes Represent the value N by the pair (N, C), where C is N mod A or (N – N mod A) mod A May 2010 Computer Arithmetic, Implementation Topics

Product or AN Codes For odd A, all weight-1 arithmetic errors are detected Arithmetic errors of weight  2 may go undetected e.g., the error = 215 – 25 undetectable with A = 3, 11, or 31 Error detection: check divisibility by A Encoding/decoding: multiply/divide by A Arithmetic also requires multiplication and division by A Product codes are nonseparate (nonseparable) codes Data and redundant check info are intermixed May 2010 Computer Arithmetic, Implementation Topics

Low-Cost Product Codes
Low-cost product codes use low-cost check moduli of the form A = 2a – 1 Multiplication by A = 2a – 1: done by shift-subtract Division by A = 2a – 1: done a bits at a time as follows Given y = (2a – 1)x, find x by computing 2a x – y . . . xxxx – xxxx xxxx = xxxx xxxx Unknown 2a x Known (2a – 1)x Unknown x Theorem 27.1: Any unidirectional error with arithmetic weight of at most a – 1 is detectable by a low-cost product code based on A = 2a – 1 May 2010 Computer Arithmetic, Implementation Topics

Arithmetic on AN-Coded Operands
Add/subtract is done directly: Ax  Ay = A(x  y) Direct multiplication results in: Aa  Ax = A2ax The result must be corrected through division by A For division, if z = qd + s, we have: Az = q(Ad) + As Thus, q is unprotected Possible cure: premultiply the dividend Az by A The result will need correction Square rooting leads to a problem similar to division  A2x  =  A x  which is not the same as A  x  May 2010 Computer Arithmetic, Implementation Topics

Residue and Inverse Residue Codes
Represent N by the pair (N, C(N)), where C(N) = N mod A Residue codes are separate (separable) codes Separate data and check parts make decoding trivial Encoding: given N, compute C(N) = N mod A Low-cost residue codes use A = 2a – 1 Arithmetic on residue-coded operands Add/subtract: data and check parts are handled separately (x, C(x))  (y, C(y)) = (x  y, (C(x)  C(y)) mod A) Multiply (a, C(a))  (x, C(x)) = (a  x, (C(a)C(x)) mod A) Divide/square-root: difficult May 2010 Computer Arithmetic, Implementation Topics

Arithmetic on Residue-Coded Operands
Add/subtract: Data and check parts are handled separately (x, C(x))  (y, C(y)) = (x  y, (C(x)  C(y)) mod A) Multiply (a, C(a))  (x, C(x)) = (a  x, (C(a)C(x)) mod A) Divide/square-root: difficult Fig. 27.4 Arithmetic processor with residue checking. May 2010 Computer Arithmetic, Implementation Topics

Example: Residue Checked Adder

27.3 Arithmetic Error-Correcting Codes
–––––––––––––––––––––––––––––––––––––––– Positive Syndrome Negative Syndrome error mod 7 mod 15 error mod 7 mod 15 1 1 1 –1 6 14 2 2 2 –2 5 13 4 4 4 –4 3 11 8 1 8 –8 6 7 – – – – – – – – – – 16, –16, 32, –32, Table 27.1 Error syndromes for weight-1 arithmetic errors in the (7, 15) biresidue code Because all the symptoms in this table are different, any weight-1 arithmetic error is correctable by the (mod 7, mod 15) biresidue code May 2010 Computer Arithmetic, Implementation Topics

Properties of Biresidue Codes
Biresidue code with relatively prime low-cost check moduli A = 2a – 1 and B = 2b – 1 supports a  b bits of data for weight-1 error correction Representational redundancy = (a + b)/(ab) = 1/a + 1/b May 2010 Computer Arithmetic, Implementation Topics

27.4 Self-Checking Function Units
Self-checking (SC) unit: any fault from a prescribed set does not affect the correct output (masked) or leads to a noncodeword output (detected) An invalid result is: Detected immediately by a code checker, or Propagated downstream by the next self-checking unit To build SC units, we need SC code checkers that never validate a noncodeword, even when they are faulty May 2010 Computer Arithmetic, Implementation Topics

Design of a Self-Checking Code Checker
Example: SC checker for inverse residue code (N, C' (N)) N mod A should be the bitwise complement of C' (N) Verifying that signal pairs (xi, yi) are all (1, 0) or (0, 1) is the same as finding the AND of Boolean values encoded as: 1: (1, 0) or (0, 1) 0: (0, 0) or (1, 1) Fig Two-input AND circuit, with 2-bit inputs (xi, yi) and (xi, yi), for use in a self-checking code checker. May 2010 Computer Arithmetic, Implementation Topics

Case Study: Self-Checking Adders
(a) Parity prediction (b) Parity preservation Fig. 27.6 Self-checking adders with parity-encoded inputs and output. P/R = Parity-to-redundant converter R/P = Redundant-to-parity converter May 2010 Computer Arithmetic, Implementation Topics

27.5 Algorithm-Based Fault Tolerance
Alternative strategy to error detection after each basic operation: Accept that operations may yield incorrect results Detect/correct errors at data-structure or application level Example: multiplication of matrices X and Y yielding P Row, column, and full checksum matrices (mod 8) M = Mr = Mc = Mf = Fig A 3  3 matrix M with its row, column, and full checksum matrices Mr, Mc, and Mf. May 2010 Computer Arithmetic, Implementation Topics

Properties of Checksum Matrices
Theorem 27.3: If P = X  Y , we have Pf = Xc  Yr (with floating-point values, the equalities are approximate) M = Mr = Mc = Mf = Fig. 27.7 Theorem 27.4: In a full-checksum matrix, any single erroneous element can be corrected and any three errors can be detected May 2010 Computer Arithmetic, Implementation Topics

27.6 Fault-Tolerant RNS Arithmetic
Residue number systems allow very elegant and effective error detection and correction schemes by means of redundant residues (extra moduli) Example: RNS(8 | 7 | 5 | 3), Dynamic range M = 8  7  5  3 = 840; redundant modulus: 11. Any error confined to a single residue detectable. Error detection (the redundant modulus must be the largest one, say m): 1. Use other residues to compute the residue of the number mod m (this process is known as base extension) 2. Compare the computed and actual mod-m residues The beauty of this method is that arithmetic algorithms are completely unaffected; error detection is made possible by simply extending the dynamic range of the RNS May 2010 Computer Arithmetic, Implementation Topics

Example RNS with two Redundant Residues
RNS(8 | 7 | 5 | 3), with redundant moduli 13 and 11 Representation of = (12, 3, 1, 4, 0, 1)RNS Corrupted version = (12, 3, 1, 6, 0, 1)RNS Transform (–,–,1,6,0,1) to (5,1,1,6,0,1) via base extension Reconstructed number = ( 5, 1, 1, 6, 0, 1)RNS The difference between the first two components of the corrupted and reconstructed numbers is (+7, +2) This constitutes a syndrome, allowing us to correct the error May 2010 Computer Arithmetic, Implementation Topics

28 Reconfigurable Arithmetic
Chapter Goals Examine arithmetic algorithms and designs appropriate for implementation on FPGAs (one-of-a-kind, low-volume, prototype systems) Chapter Highlights Suitable adder designs beyond ripple-carry Design choices for multipliers and dividers Table-based and “distributed” arithmetic Techniques for function evaluation Enhanced FPGAs and higher-level alternatives May 2010 Computer Arithmetic, Implementation Topics

Reconfigurable Arithmetic: Topics
Topics in This Chapter 28.1 Programmable Logic Devices 28.2 Adder Designs for FPGAs 28.3 Multiplier and Divider Designs 28.4 Tabular and Distributed Arithmetic 28.5 Function Evaluation on FPGAs 28.6 Beyond Fine-Grained Devices May 2010 Computer Arithmetic, Implementation Topics

28.1 Programmable Logic Devices
LB I/O block Programmable interconnects Logic block (or LB cluster) Fig Examples of programmable sequential logic. May 2010 Computer Arithmetic, Implementation Topics

Programmability Mechanisms
Slide to be completed (a) Tristate buffer 1 (b) Pass transistor (c) Multiplexer Memory cell Fig Some memory-controlled switches and interconnections in programmable logic devices. May 2010 Computer Arithmetic, Implementation Topics

Configurable Logic Blocks
Inputs FF Carry-in Carry-out Outputs Logic or LUT 1 2 3 4 y0 y1 y2 x0 x1 x2 x3 x4 1 Fig Structure of a simple logic block. May 2010 Computer Arithmetic, Implementation Topics

The Interconnect Fabric
LB or cluster Vertical wiring channels Switch box Horizontal wiring channels Fig A possible arrangement of programmable interconnects between LBs or LB clusters. May 2010 Computer Arithmetic, Implementation Topics

Standard FPGA Design Flow
1. Specification: Creating the design files, typically via a hardware description language such as Verilog, VHDL, or Abel 2. Synthesis: Converting the design files into interconnected networks of gates and other standard logic circuit elements 3. Partitioning: Assigning the logic elements of stage 2 to specific physical circuit elements that are capable of realizing them 4. Placement: Mapping of the physical circuit elements of stage 3 to specific physical locations of the target FPGA device 5. Routing: Mapping of the interconnections prescribed in stage 2 to specific physical wires on the target FPGA device 6. Configuration: Generation of the requisite bit-stream file that holds configuration bits for the target FPGA device 7. Programming: Uploading the bit-stream file of stage 6 to memory elements within the FPGA device 8. Verification: Ensuring the correctness of the final design, in terms of both function and timing, via simulation and testing May 2010 Computer Arithmetic, Implementation Topics

28.2 Adder Designs for FPGAs
This slide to include a discussion of ripple-carry adders and built-in carry chains in FPGAs May 2010 Computer Arithmetic, Implementation Topics

Carry-Skip Addition Slide to be completed / 5 1 Skip logic / 6 cout cin Adder / 6 Fig Possible design of a 16-bit carry-skip adder on an FPGA. May 2010 Computer Arithmetic, Implementation Topics

Carry-Select Addition
Slide to be completed / 2 2 bits 1 3 bits 4 bits 6 bits 1 bit / 3 / 4 / 6 Fig Possible design of a carry-select adder on an FPGA. May 2010 Computer Arithmetic, Implementation Topics

28.3 Multiplier and Divider Designs
x1 x0 4 LUTs a1 a0 x3 x2 p0 p1 4-bit adder 6-bit adder p2 p3 p4 p5 p6 p7 cout Slide to be completed Fig Divide-and-conquer 4  4 multiplier design using 4-input lookup tables and ripple-carry adders. May 2010 Computer Arithmetic, Implementation Topics

Multiplication by Constants
xL 8 LUTs xH 4 / 8 13xH 13x 13xL 8-bit adder Slide to be completed Fig Multiplication of an 8-bit input by 13, using LUTs. May 2010 Computer Arithmetic, Implementation Topics

Division on FPGAs Slide to be completed May 2010 Computer Arithmetic, Implementation Topics

28.4 Tabular and Distributed Arithmetic
Slide to be completed May 2010 Computer Arithmetic, Implementation Topics

Second-Order Digital Filter: Definition
x (i+1) y (i) = a(0)x (i) + a(1)x (i–1) + a(2)x (i–2) – b(1)y (i–1) – b(2)y (i–2) x (i) a(j)s and b(j)s are constants . x (3) x (2) Current and two previous inputs Two previous outputs x (1) Expand the equation for y (i) in terms of the bits in operands x = (x0.x–1x– x–l )2’s-compl and y = (y0.y–1y– y–l )2’s-compl , where the summations range from j = – l to j = –1 y (i) = a(0)(–x0(i) + 2j xj(i)) + a(1)(–x0(i-1) + 2j xj(i-1)) + a(2)(–x0(i-2) + 2j xj(i-2)) – b(1)(–y0(i-1) + 2j yj(i-1)) – b(2)(–y0(i-2) + 2j yj(i-2)) Filter Latch y (i) . y (3) y (2) y (1) Define f(s, t, u, v, w) = a(0)s + a(1)t + a(2)u – b(1)v – b(2)w y (i) = 2j f(xj(i), xj(i-1), xj(i-2), yj(i-1), yj(i-2)) – f(x0(i), x0(i-1), x0(i-2), y0(i-1), y0(i-2)) May 2010 Computer Arithmetic, Implementation Topics

Second-Order Digital Filter: Bit-Serial Implementation
i th output being formed i th input (i – 1) th output (i – 1) th input 32-entry lookup table (i – 2) th input Copy at the end of cycle (i – 2) th output Fig Bit-serial tabular realization of a second-order filter. May 2010 Computer Arithmetic, Implementation Topics

28.5 Function Evaluation on FPGAs
Add/Sub >> 1 >> 2 >> 3 Sign logic e(0) e(1) e(2) e(3) x y z z(4) x(4) y(4) z(3) x(3) y(3) x(2) y(2) x(1) y(1) z(2) z(1) Slide to be completed Fig The first four stages of an unrolled CORDIC processor. May 2010 Computer Arithmetic, Implementation Topics

Implementing Convergence Schemes
Slide to be completed Lookup table x Convergence step y(0)  f(x) y(1) y(2) Fig Generic convergence structure for function evaluation. May 2010 Computer Arithmetic, Implementation Topics

28.6 Beyond Fine-Grained Devices
General-purpose processor Special-purpose processor Field-programmable arithmetic array 1024 Slide to be completed Fig The design space for arithmetic-intensive applications. May 2010 Computer Arithmetic, Implementation Topics

A Past, Present, and Future
Appendix Goals Wrap things up, provide perspective, and examine arithmetic in a few key systems Appendix Highlights One must look at arithmetic in context of  Computational requirements  Technological constraints  Overall system design goals  Past and future developments Current trends and research directions? May 2010 Computer Arithmetic, Implementation Topics

Past, Present, and Future: Topics
Topics in This Chapter A.1 Historical Perspective A.2 Early High-Performance Computers A.3 Deeply Pipelined Vector Machines A.4 The DSP Revolution A.5 Supercomputers on Our Laps A.6 Trends, Outlook, and Resources May 2010 Computer Arithmetic, Implementation Topics

A.1 Historical Perspective
Babbage was aware of ideas such as carry-skip addition, carry-save addition, and restoring division Modern reconstruction from Meccano parts; 1848 May 2010 Computer Arithmetic, Implementation Topics

Computer Arithmetic in the 1940s
Machine arithmetic was crucial in proving the feasibility of computing with stored-program electronic devices Hardware for addition/subtraction, use of complement representation, and shift-add multiplication and division algorithms were developed and fine-tuned A seminal report by A.W. Burkes, H.H. Goldstein, and J. von Neumann contained ideas on choice of number radix, carry propagation chains, fast multiplication via carry-save addition, and restoring division State of computer arithmetic circa 1950: Overview paper by R.F. Shaw [Shaw50] May 2010 Computer Arithmetic, Implementation Topics

The focus shifted from feasibility to algorithmic speedup methods and cost-effective hardware realizations By the end of the decade, virtually all important fast-adder designs had already been published or were in the final phases of development Residue arithmetic, SRT division, CORDIC algorithms were proposed and implemented Snapshot of the field circa 1960: Overview paper by O.L. MacSorley [MacS61] May 2010 Computer Arithmetic, Implementation Topics

Tree multipliers, array multipliers, high-radix dividers, convergence division, redundant signed-digit arithmetic were introduced Implementation of floating-point arithmetic operations in hardware or firmware (in microprogram) became prevalent Many innovative ideas originated from the design of early supercomputers, when the demand for high performance, along with the still high cost of hardware, led designers to novel and cost-effective solutions Examples reflecting the sate of the art near the end of this decade: IBM’s System/360 Model 91 [Ande67] Control Data Corporation’s CDC 6600 [Thor70] May 2010 Computer Arithmetic, Implementation Topics

Advent of microprocessors and vector supercomputers Early LSI chips were quite limited in the number of transistors or logic gates that they could accommodate Microprogrammed control (with just a hardware adder) was a natural choice for single-chip processors which were not yet expected to offer high performance For high end machines, pipelining methods were perfected to allow the throughput of arithmetic units to keep up with computational demand in vector supercomputers Examples reflecting the state of the art near the end of this decade: Cray 1 supercomputer and its successors May 2010 Computer Arithmetic, Implementation Topics

Spread of VLSI triggered a reconsideration of all arithmetic designs in light of interconnection cost and pin limitations For example, carry-lookahead adders, thought to be ill-suited to VLSI, were shown to be efficiently realizable after suitable modifications. Similar ideas were applied to more efficient VLSI tree and array multipliers Bit-serial and on-line arithmetic were advanced to deal with severe pin limitations in VLSI packages Arithmetic-intensive signal processing functions became driving forces for low-cost and/or high-performance embedded hardware: DSP chips May 2010 Computer Arithmetic, Implementation Topics

No breakthrough design concept Demand for performance led to fine-tuning of arithmetic algorithms and implementations (many hybrid designs) Increasing use of table lookup and tight integration of arithmetic unit and other parts of the processor for maximum performance Clock speeds reached and surpassed 100, 200, 300, 400, and 500 MHz in rapid succession; pipelining used to ensure smooth flow of data through the system Examples reflecting the state of the art near the end of this decade: Intel’s Pentium Pro (P6)  Pentium II Several high-end DSP chips May 2010 Computer Arithmetic, Implementation Topics

Three parallel and interacting trends: Availability of many millions of transistors on a single microchip Energy requirements and heat dissipation of the said transistors Shift of focus from scientific computations to media processing Continued refinement of many existing methods, particularly those based on table lookup New challenges posed by multi-GHz clock rates Increased emphasis on low-power design Work on, and approval of, the IEEE floating-point standard May 2010 Computer Arithmetic, Implementation Topics

A.2 Early High-Performance Computers
IBM System 360 Model 91 (360/91, for short; mid 1960s) Part of a family of machines with the same instruction-set architecture Had multiple function units and an elaborate scheduling and interlocking hardware algorithm to take advantage of them for high performance Clock cycle = 20 ns (quite aggressive for its day) Used 2 concurrently operating floating-point execution units performing: Two-stage pipelined addition 12  56 pipelined partial-tree multiplication Division by repeated multiplications (initial versions of the machine sometimes yielded an incorrect LSB for the quotient) May 2010 Computer Arithmetic, Implementation Topics

The IBM System 360 Model 91 Fig. A.1 Overall structure of the IBM System/360 Model 91 floating-point execution unit. May 2010 Computer Arithmetic, Implementation Topics

A.3 Deeply Pipelined Vector Machines
Cray X-MP/Model 24 (multiple-processor vector machine) Had multiple function units, each of which could produce a new result on every clock tick, given suitably long vectors to process Clock cycle = 9.5 ns Used 5 integer/logic function units and 3 floating-point function units Integer/Logic units: add, shift, logical 1, logical 2, weight/parity Floating-point units: add (6 stages), multiply (7 stages), reciprocal approximation (14 stages) Pipeline setup and shutdown overheads Vector unit not efficient for short vectors (break-even point) Pipeline chaining May 2010 Computer Arithmetic, Implementation Topics

Cray X-MP Vector Computer
Fig. A.2 The vector section of one of the processors in the Cray X-MP/ Model 24 supercomputer. May 2010 Computer Arithmetic, Implementation Topics

A.4 The DSP Revolution Special-purpose DSPs have used a wide variety of unconventional arithmetic methods; e.g., RNS or logarithmic number representation General-purpose DSPs provide an instruction set that is tuned to the needs of arithmetic-intensive signal processing applications Example DSP instructions ADD A, B { A + B  B } SUB X, A { A – X  A } MPY X1, X0, B { X1  X0  B } MAC Y1, X1, A { A  Y1  X1  A } AND X1, A { A AND X1  A } General-purpose DSPs come in integer and floating-point varieties May 2010 Computer Arithmetic, Implementation Topics

Fixed-Point DSP Example
Fig. A.3 Block diagram of the data ALU in Motorola’s DSP56002 (fixed-point) processor. May 2010 Computer Arithmetic, Implementation Topics

Floating-Point DSP Example
Fig. A.4 Block diagram of the data ALU in Motorola’s DSP96002 (floating-point) processor. May 2010 Computer Arithmetic, Implementation Topics

A.5 Supercomputers on Our Laps
In the beginning, there was the 8080; led to the 80x86 = IA32 ISA Half a dozen or so pipeline stages 80286 80386 80486 Pentium (80586) A dozen or so pipeline stages, with out-of-order instruction execution Pentium Pro Pentium II Pentium III Celeron Two dozens or so pipeline stages Pentium 4 More advanced technology Instructions are broken into micro-ops which are executed out-of-order but retired in-order More advanced technology May 2010 Computer Arithmetic, Implementation Topics

Performance Trends in Intel Microprocessors

Arithmetic in the Intel Pentium Pro Microprocessor
Fig Key parts of the CPU in the Intel Pentium Pro (P6) microprocessor. May 2010 Computer Arithmetic, Implementation Topics

A.6 Trends, Outlook, and Resources
Current focus areas in computer arithmetic Design: Shift of attention from algorithms to optimizations at the level of transistors and wires This explains the proliferation of hybrid designs Technology: Predominantly CMOS, with a phenomenal rate of improvement in size/speed New technologies cannot compete Applications: Shift from high-speed or high-throughput designs in mainframes to embedded systems requiring Low cost Low power May 2010 Computer Arithmetic, Implementation Topics

Ongoing Debates and New Paradigms
Renewed interest in bit- and digit-serial arithmetic as mechanisms to reduce the VLSI area and to improve packageability and testability Synchronous vs asynchronous design (asynchrony has some overhead, but an equivalent overhead is being paid for clock distribution and/or systolization) New design paradigms may alter the way in which we view or design arithmetic circuits Neuronlike computational elements Optical computing (redundant representations) Multivalued logic (match to high-radix arithmetic) Configurable logic Arithmetic complexity theory May 2010 Computer Arithmetic, Implementation Topics

Computer Arithmetic Timeline
Decade 40s 50s 60s 70s 80s 90s 00s 10s 1940 2020 1960 1980 2000 Snapshot [Burk46] Key ideas, innovations, advancements, technology traits, and milestones Binary format, carry chains, stored carry, carry-save multiplier, restoring divider [Shaw50] Carry-lookahead adder, high-radix multiplier, SRT divider, CORDIC algorithms Tree/array multiplier, high-radix & convergence dividers, signed-digit, floating point Pipelined arithmetic, vector supercomputer, microprocessor, ARITH-2/3/4 symposia VLSI, embedded system, digital signal processor, on-line arithmetic, IEEE CMOS dominance, circuit-level optimization, hybrid design, deep pipeline, table lookup Power/energy/heat reduction, media processing, FPGA-based arith., IEEE Teraflops on laptop (or pocket device?), asynchronous design, nanodevice arithmetic [MacS61] [Thor70] [Ande67] [Swar90] [Swar09] [Garn76]] Fig. A.6 Computer arithmetic through the decades. May 2010 Computer Arithmetic, Implementation Topics

The End! You’re up to date. Take my advice and try to keep it that way. It’ll be tough to do; make no mistake about it. The phone will ring and it’ll be the administrator –– talking about budgets. The doctors will come in, and they’ll want this bit of information and that. Then you’ll get the salesman. Until at the end of the day you’ll wonder what happened to it and what you’ve accomplished; what you’ve achieved. That’s the way the next day can go, and the next, and the one after that. Until you find a year has slipped by, and another, and another. And then suddenly, one day, you’ll find everything you knew is out of date. That’s when it’s too late to change. Listen to an old man who’s been through it all, who made the mistake of falling behind. Don’t let it happen to you! Lock yourself in a closet if you have to! Get away from the phone and the files and paper, and read and learn and listen and keep up to date. Then they can never touch you, never say, “He’s finished, all washed up; he belongs to yesterday.” Arthur Hailey, The Final Diagnosis May 2010 Computer Arithmetic, Implementation Topics

Part I Number Representation

Similar presentations

Presentation on theme: "Part I Number Representation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Part I Number Representation

Similar presentations

Presentation on theme: "Part I Number Representation"— Presentation transcript:

Similar presentations

About project

Feedback