(Part 3-Floating Point Arithmetic)

(Part 3-Floating Point Arithmetic)
SERIAL PROCESSORS MACHINE LAYER (Part 3-Floating Point Arithmetic) Yavuz Oruç Sabatech Corporation & University of Maryland ©All rights reserved 2005. 11/27/2018 6:47 PM

Floating-Point Numbers
Informally speaking, a floating-point number is a number which can be moved on a scale by adjusting the location of its radix-point. As compared to fixed-point number systems, floating-point number systems permit computers to process a very large range of numbers within a common framework of arithmetic operations which approximates real arithmetic. One of the most commonly used floating-point number representations is the scientific notation. In this representation, a floating number, x, is written as a product of a number, m, called the mantissa, or significand, and a non-negative integer, e, called the exponent: x=± m Be, where B is an integer greater than or equal to 2. B is called the base, and e is called the exponent. 11/27/2018 6:47 PM

Floating-Point Numbers (Cont’d)
Example The following are all floating-point numbers:  100 23.4  82  2-3 It is easy to see that there are many choices for m and e to represent the same floating-point number. For example, when base B =10, the first number above can be expressed as  102 =  102 =  10-4 =  100 =  101 = …, etc. 11/27/2018 6:47 PM

Floating-Point Numbers (Cont’d)
When one of these representations is selected such that the mantissa lies between 1 and the base, i.e., 1 < m < B, then the number is said to be normalized, and the corresponding representation is called a normalized floating-point number. Although a number has infinitely many floating-point representations, it has only one normalized floating-point representation. The normalized representations for the numbers in the example above are  102 2.34  81  2--1 Nearly all modern computers use the normalized scientific representation with base 2, i.e., the notation x=± m  2e, where 1 < m < 2. 11/27/2018 6:47 PM

Machine Representation of Floating-Point Numbers
Floating numbers are stored in a computer by their sign, mantissa, and exponent. The number of bits allocated to the mantissa and exponent, and the exact way these terms are stored vary from one system to another. Here we will describe one of the most commonly used representations, the one using a hidden bit, and a biased exponent. The three terms in this representation are stored as Sign Exponent Mantissa S E M where S is a 1-bit sign, and E and M are k-bit and p-bit numbers which represent e and m, respectively. A floating-point number in this representation is stored as a sign-magnitude number and is positive when S = 0 and it is negative when S =1. 11/27/2018 6:47 PM

Machine Representation of Floating-Point Numbers (Cont’d)
The true exponent, e, is found by subtracting a fixed number from E, called the bias. For a k-bit long exponent, the bias is 2k-1-1, and the true exponent and E are related by e = E- (2k-1-1). For a k-bit exponent, this mapping from the bit-representation to a true exponent carries the domain of unsigned k-bit numbers {0,1,…, 2k-1} onto the set of signed exponents, {-2k-1+1, -2k-1+2,…,0,1,…, 2k-1} This translation from a true exponent to a biased exponent makes the lexicographical ordering of the binary tuple representation of the floating-point numbers consistent with their decimal values. 11/27/2018 6:47 PM

The mantissa of a floating-point number is stored in M except its first digit. Since 1< m < 2 for binary representations, the digit to the left of the point in the mantissa is always 1. Therefore, it does not have to be stored, although a 1 is always inserted as the first digit of the mantissa in floating-point operations. This first digit is called the hidden bit. If the p-bit number M is given by then the mantissa with a hidden bit is given by 11/27/2018 6:47 PM

Example: Consider the decimal floating-point number ( )10. Converting the mantissa into a 16-binary floating-point number we have ( )10. = ( )2, and normalizing the binary number, we have ( )10. = ( )2 2-7. Hence, with p = 16, and an 8-bit biased exponent, this number will be represented internally in a computer as S E M 1 11/27/2018 6:47 PM

Floating-point umbers in the normalized representation with p = k = 2. 11/27/2018 6:47 PM

Remark One problem with the normalized floating-point number system is that there is no way to normalize zero since all of its bits are identically 0. Instead, we adopt the convention that both (1+k+p)-bit representations shown below represent 0. More generally, floating-point numbers with a zero exponent are called de-normalized floating-numbers and their values are computed by setting the hidden bit to 0. S E M 0000…0 …0 1 11/27/2018 6:47 PM

Precision and Mantissa Range of A Floating-Point Number System
When we fix the values of p and k, the number of floating-point numbers which can be represented is also fixed. A p-bit normalized mantissa with a 1-bit sign and a k-bit biased exponent gives us 2p+k+1 floating-point numbers in all. The smallest differential mantissa, i.e., 2-p, in a floating-point representation is called the precision of the representation. Any two numbers in a floating–point number system with or without a normalized p-bit mantissa cannot be closer than its precision, i.e., |x-y| > 2-p for any x,y which can be expressed in such a number system. The normalization of floating-point numbers further constrains the representation by introducing a gap between 0 and the rest of the numbers. The mantissa range of a floating-point number system is the interval of mantissas which can be represented in that floating-point number system. 11/27/2018 6:47 PM

This contrast between a normalized and un-normalized representation is illustrated in the table below. The shaded area in the normalized scale shows that there is a set of floating-point numbers in the neighborhood of 0 which are included in the un-normalized floating-point number system, but cannot be represented in the normalized floating-point number system. 11/27/2018 6:47 PM

Single and Double Precision Floating-Point Numbers
By convention, a floating-point number is called a single precision or double precision number based on either (1) the number of bits it uses or (2) the number of registers it occupies. We will adopt the latter convention for classifying a floating-point number as a single or double precision number. Thus, in single precision, each register in the register file of a processor represents a floating-point number by itself. If the processor supports 8, 16 and 32-bit registers, it is possible to define 8-bit, 16-bit, and 32-bit floating-point numbers in single precision. In double precision, pairs of registers, typically with adjacent indicies, and starting out with R0 represent a single floating-point number. For example, in a register file with four registers, R0, R1, R2, R3, each of the pairs (R0,R1), (R1,R2), (R2,R3), and (R3,R0) represents a floating-point number. The higher order bits of the number are stored in the first register in the pair, and the lower order bits are stored in the second entry. Using this convention, it is thus possible to deal with 16-bit, 32-bit, and 64-bit floating-point numbers in double precision. 11/27/2018 6:47 PM

Single and Double Precision Floating-Point Numbers (Cont’d)
For a given number of bits, the allocation of bits between the mantissa and exponent sections in a floating number representation will be kept the same whether or not one or two registers are used. However, in single precision, all bits are stored in a single register, while in double precision they are divided between two registers. For example, the single precision representation with a single 32-bit register, and double precision representation with two16-bit registers have the same bit allocation. The bit-allocations for 32-bit and 64-bit cases correspond to the IEEE-754 single and double precision floating-point representations. In IEEE-754 floating-point number representations, the number of bits in the representation is used as a convention to classify a floating-point number as a single and double precision floating-point number. 11/27/2018 6:47 PM

The bit-allocations for 32-bit and 64-bit cases correspond to the IEEE-754 single and double precision floating-point representations. In IEEE-754 floating-point number representations, the number of bits in the representation is used as a convention to classify a floating-point number as a single and double precision floating-point number. Representation size Sign Exponent Mantissa 8 1 2 5 16 4 11 32 23 64 52 Typical floating-point representations. 11/27/2018 6:47 PM

The most positive and most negative representations in IEEE-754 single precision floating-point format (32-bit numbers) are 11/27/2018 6:47 PM

Representation of Infinity and Other Exceptional Numbers
The above formulas can be used to find the representation of most numbers. However, some representations are reserved for some special cases, such as zero, infinity, and not-a-number (NaN). The last number denotes the outcome of undefined operations such as 0/0, and 0 . We already saw that the normalized floating-point formulas above cannot represent zero because of the hidden bit without a special interpretation of the numbers with the most negative exponent. The other special cases are the infinity and the NaN. These cases are represented by setting the exponent to its maximum (all 1's). Therefore, when E = 2k-1, or equivalently e=2k-1, the representation corresponds to either infinity or NaN. We use the mantissa to distinguish between the two. If M = 0 and E = 2k-1, then the representation is for infinity. If M ≠ 0 but E = 2k-1, then the representation is for NaN. 11/27/2018 6:47 PM

Representation of Infinity and Other Exceptional Numbers (Cont’d)
In all of the special cases, the sign bit is used to distinguish between positive and negative versions of these numbers, i.e., +0, -0, +, -, +NaN, -NaN. The NaNs are further refined into quiet NANs (QNaNs) and significant NaNs (SNaNs). The QNaNs are designated by setting the most significant bit of the mantissa, and the SNaNs are specified by clearing the same bit. The QNaNs can be viewed as NaNs that can be tolerated during the course of a floating-point computation whereas SNaNs will force the processor to signal an invalid operation as in the case of division of 0 by 0. 11/27/2018 6:47 PM

Representation of Infinity and Other Exceptional Numbers (Cont’d)
Example For example, the first two numbers below represent + and -, respectively, and the last two represent NaNs: 11/27/2018 6:47 PM

Mantissas In 2’s Complement Format
Most processors use a sign-magnitude representation to represent mantissas in floating-point numbers. Instead, one can also use 1's or 2's complement notations as in fixed-point numbers to represent signed mantissas. This makes the subtraction of mantissas easier to handle. In un-normalized 2's complement mantissa representation, the sign of the mantissa is specified by the leading bit. If that bit is 0 then the mantissa is positive, and if it is 1, then the mantissa is negative. 11/27/2018 6:47 PM

Mantissas In 2’s Complement Format (Cont’d)
Determining the value of a floating-point number with a 2's complement mantissa is only slightly more complex. In fact, if the leading bit of the mantissa is 0 then the value of the number is positive, and it is the same as if its mantissa is expressed in sign-magnitude notation. When the leading bit is 1 then the number is negative, and its value is determined by complementing its bits and adding ½-p to it, where p is the number of bits in the mantissa part of the number. Example: Consider the mantissa in 2's complement notation. The value of this number is determined as -( )2 = -( )2 = -( )10. 11/27/2018 6:47 PM

Homework Set 5 Problem 1. For each of the numbers below give its normalized floating-point form in the base it is expressed. (a) (345.21)8  (b) ( )2 (c) ( )10 (d) ( )4  46 Problem 2. For each of the decimal numbers below show its normalized and un-normalized representations in 8-bit biased exponent and 24-bit sign magnitude mantissa format. (a) (b) (c) (d) Problem 3. How many floating-point numbers can be written with a 2-bit biased exponent and 3- bit normalized sign-magnitude mantissa assuming (a) the numbers are not de-normalized (b) de- normalized for the most negative and most positive exponents. Specify the most positive and most negative numbers, and precision of the number system in each case. Problem 4. Specify the decimal value of each of the floating-point machine numbers assuming that the exponent is 8 bits and biased and the mantissa is 23 bits and in 2’s complement format. (a) (b) Problem 5. Develop an algorithm to compare two floating-point numbers E1,M1 and E2, M2where E1 and E2 are k-bit biased exponents and M1 and M2 are(p+1)-bit 2’s complement mantissas. Sign Biased exponent 2’s complement mantissa 1 11/27/2018 6:47 PM

(Part 3-Floating Point Arithmetic)

Similar presentations

Presentation on theme: "(Part 3-Floating Point Arithmetic)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(Part 3-Floating Point Arithmetic)

Similar presentations

Presentation on theme: "(Part 3-Floating Point Arithmetic)"— Presentation transcript:

Similar presentations

About project

Feedback