Data Representation Overflow Limits

Data Representation Overflow Limits
No good explanations in the books Try to understand in class and from the slides

Representation of Data
All data in a the computer’s memory and files are represented as a sequence of bits Bit : unit of storage, represents the level of an electrical charge. Can be either 0 or 1 Byte: another unit of storage that occupies 8 bits. A bit sequence can represent many different things: We will see that a bit string (such as ) can mean several different things depending on the representation that is agreed upon. So, how should we represent integers, characters, real numbers, strings, structures, in terms of bits? Representations must be efficent and convenient We will see some of them

Characters In C/C++, characters are actually integers of length one byte, with special meaning as characters - the ASCII mapping char c = 'a'; //stores the code corresponding to letter ‘a’, but //prints the character a when printed ASCII Standard American Standard Code for Information Interchange dates back 1960's 256 different codes ( ) and corresponding characters The characters with codes , and 127 are control characters At these times, standardizing communication related and telegraphic codes was important. That is why most of the control characters are for this purpose and now obselete. Though, some OSs implement some of the control chars. Extended ASCII (128 – 255): there are different conventions and interpretations

ASCII See and for more info Blue control characters are the ones important for Windows/DOS These are the ASCII codes – of course you are not expected to memorize them, just know that there are codes for special characters and that numbers and letters of a case are consecutive (so one can do '1'+5 to get code of '6', or subtract 32 to go from lowercase to get corresponding uppercase) | 0 NUL| 1 SOH| 2 STX| 3 ETX| 4 EOT| 5 ENQ| 6 ACK| 7 BEL| | 8 BS | 9 HT | 10 LF | 11 VT | 12 FF | 13 CR | 14 SO | 15 SI | | 16 DLE| 17 DC1| 18 DC2| 19 DC3| 20 DC4| 21 NAK| 22 SYN| 23 ETB| | 24 CAN| 25 EM | 26 SUB| 27 ESC| 28 FS | 29 GS | 30 RS | 31 US | | 32 SP | 33 ! | 34 " | 35 # | 36 $ | 37 % | 38 & | 39 ' | | 40 ( | 41 ) | 42 * | | 44 , | | | 47 / | | | | | | | | | | | | | 58 : | 59 ; | 60 < | 61 = | 62 > | 63 ? | | 64 @ | 65 A | 66 B | 67 C | 68 D | 69 E | 70 F | 71 G | | 72 H | 73 I | 74 J | 75 K | 76 L | 77 M | 78 N | 79 O | | 80 P | 81 Q | 82 R | 83 S | 84 T | 85 U | 86 V | 87 W | | 88 X | 89 Y | 90 Z | 91 [ | 92 \ | 93 ] | 94 ^ | 95 _ | | 96 ` | 97 a | 98 b | 99 c |100 d |101 e |102 f |103 g | |104 h |105 i |106 j |107 k |108 l |109 m |110 n |111 o | |112 p |113 q |114 r |115 s |116 t |117 u |118 v |119 w | |120 x |121 y |122 z |123 { |124 | |125 } |126 ~ |127 DEL| Special control characters. Most of them are now obselete. Some OSs implement some of them and meanings may change from OS to OS ASCII was developed in 1960's. At these times, standardizing communication related and telegraphic codes was important. That is why most of the control characters are for this purpose and now obselete.

Integer Number Representation
Sign-Magnitude Representation 1s complement Representation 2s complement Representation Comparison of different representations

Number Representation
Fundamental problem: Fixed-size representation (e.g. 4 bytes for integers) can’t encode all numbers Usually sufficient in most applications, But a potential source of bugs: overflow need to be careful of it Other problems: How to represent negative numbers, floating points? Historically, many different representations. How to do subtraction effectively?

Base 2 – unsigned numbers
MSB – Most Significant Bit LSB – Least Significant Bit  0 //8-bit binary representation of positive integers  1  2  3 ...  255 Representation: an n-bit number in base b has decimal value = di is the coefficient of the ith bit. Bit 0 is the LSB and bit n-1 is the MSB. Example for base 2 (binary): = 1 x x x x 23 = 1110

Sign/Magnitude representation (also called “signed representation”)
use one of the bits (the first bit = Most Significant Bit) as a sign bit. use the rest for magnitude e.g. 000 = +0 001 = +1 010 = +2 positive numbers 011 = +3 100 = -0 101 = -1 110 = -2 negative numbers 111 = -3 range: -(2 (n-1)-1) to (2 (n-1) -1), where n is the total number of bits For n = 4, [ -(23-1) ] [ ]

Alternative representations
Most computers don’t use a “sign and magnitude” representation Drawbacks of the Sign-Magnitude representation: two 0s: one positive one negative addition and subtraction involving negative numbers are complicated Alternatives? 1's complement representation 2's complement representation: today's standard These two representations seem very similar in approach, but they differ in: Representation of negative numbers (positives are the same in all 3 representations) and Ease of arithmetic operations involving negative numbers

Signed numbers: 1’s complement
Positive numbers: first bit is 0, and the rest is the binary equivalent of the number. Negative numbers: represented by the 1’s complement of the corresponding positive number 1’s complement: invert all the bits (0's become 1; 1's become 0) e.g +8 = (0 for + sign, and for 8) - 8 = So, effectively the first bit is used for sign, but negative numbers show a distinction from those of the sign-magnitude representation. How about 0?

Range: [-(2n-1-1) . . 2n-1-1] [-7 . . . 7] Number 1’s-complement
As in the signed representation, there is a + and - 0 Range: [-(2n-1-1) n-1-1] For n = 4, [-(23-1) ] [ ]

Signed numbers: 2’s complement
Signed 2’s complement is the common representation for signed numbers used in computers For positive numbers, use 0 first and the remaining bits are the binary equivalent of the magnitude. Negative numbers are represented by the 2’s complement of the corresponding positive number. 2s complement: invert all bits and add 1 Alternative (easier) method: copy all the bits from right to left until and including the first 1, invert the rest) Ex : = -20 = single 0 addition and subtraction complexities simplified note the range (one more negative as compared to 1's complement): -2 (n-1) ... (2 (n-1) -1) current standard for representing signed integers Range: [-2n n-1-1] For n = 4, [ ] [ ]

Number ’s-complement There is only one zero There is one more negative number as compared to positives

Possible Representations: summary
Sign Magnitude: One's Complement Two's Complement 000 = = = = = = = = = = = = = = = = = = = = = = = = -1 Notice: Positive numbers are represented the same way (same bit strings) in all representations! So for all three representations, representation of a positive number is directly decimal to binary / binary to decimal conversion.

Decimal Conversion for Negatives
If you are given a bit string representing a nagative number, you can find the decimal equivalent depending on the number representation used. if sign/magnitude representation is used: If MSB is 1, that means the number is negative but this bit has no contribution to the magnitude. Convert the remaining bits to decimal for the magnitude. For example, is equivalent to –18 (- (1x16 + 1x2)) if 1s complement representation is used: To find the magnitude: invert all bits (i.e. negate): => find the positive number corresponding to the negated string 1x64 + 1x32 + 1x8 + 1x4 + 1 = 109 is equivalent to –109 note that this is the reverse operation of what we would do if we wanted to find the bit representation of –109 (find the bit rep. of 109, take 1s complement)

Decimal Conversion for Negatives– ctd.
if 2s complement representation is used: If MSB is 1, that means the number is negative. To find the magnitude: invert all bits => add => this is the negated value find the positive number corresponding to the negated string ( ) 1x64 + 1x32 + 1x8 + 1x4 + 1x2 = 110 is equivalent to –110 note that this is the reverse operation of what we would do if we wanted to find the bit representation of –110 (find the bit rep. of 110, take 2s complement)

Alternative decimal conversion – 2s comp.
You can also directly/quickly find the decimal equivalent of a 2s complement number: use the usual binary to decimal conversion, using at the most significant bit the negative for the coefficient 26 25 24 23 22 21 20 -27 Hence: = -1x27 + 1x24 + 1x21 =

conversion to decimal with 32 bit numbers – 2s comp.
Same idea as 8 bit 2s complement integers, but the most significant bit is –231. … -2,147,483, -231 27 26 25 24 23 22 21 20

A very important note Converting n bit numbers into numbers with more than n bits: copy the most significant bit (the sign bit) into the other bits Example: 4-bit to 8-bit > (both has decimal value 2) > (both has decimal value -6 in 2's complement) This method is valid for both 1's and 2's complement representations

Subtraction a-b can always be represented as a+(-b). Doing the arithmetic in this way causes wrong results in sign-multitude and 1's complement representations, but not in 2's complement. We will see an example now. Consider which is the same as 3 + (-2) In sign-magnitude representation using 4 bits: 3 + (-2) should give 1, but instead we get -5 ! = +310 = -210 = which is a wrong result To remedy this, the operation can take special notice of the sign bits and perform a subtraction instead. This complicates the implementation; we have a better solution using 2's complement (next slide).

Subtraction In 1's complement representation using 4 bits:
3 + (-2) becomes 0 ! = +310 = -210 = which is a wrong result But two's complement addition results in the correct sum without hassle. = +310 = -210 = 110 which is the correct result! We got rid of it automatically since it does not fit We got rid of it automatically since it does not fit

Why 2's Complement? There is only one zero. Range for negative numbers is one more than the other representations Subtraction can be implemented as addition (a - b = a + -b). Thus no borrowing logic needs to be implemented. Let's us give two 8-bit examples. =? – 70 = ?   -51   -70   -121 Due to fixed width of the registers, carry overflow is lost automatically.

Two's Complement – Negation
Negating a two's complement number is simple: Start at least significant bit. Copy through the first “1”; after that, invert each bit. Example: Alternatively, invert all bits and add one to the least significant bit If you negate twice, you will arrive to the same number:

Important Note on Terminology!
"2's complement" (or "two’s complement") does not mean a negative number! 2's complement is a representation used to represent all integers, not just negative integers! So 2's complement is a format specification, but we also use the term "2's complement of a number" as its negation e.g. when we want to negate a number, either from positive to negative or negative to positive, we may say "take its 2's complement".

Overview of Built-in Types and Their Ranges

Built-in Types in C++ The types which are part of the C++ language; not implemented as a class char int, long (a.k.a.. long int), short (a.k.a. short int) float, double Mostly for numeric data representation There are signed (which is the default one) and unsinged versions for integer/char storage Signed integer representation uses 2's complement Now we are going to see some characteristics of these types and their limits. Some of these discussions are not new to you (discussed in CS201), but after learning data representation and 2's complement, they will mean more to you now.

char The type char is known to store an ASCII character, but actually it stores a signed one byte integer number (2's complement representation). Since there is no other one-byte integer type in C++, char is widely used as integers as well Of course, it is also used to store a character (as seen at the beginning of this ppt file) char ch; ch = 'A'; //valid ch = 99; //valid The type char is by default "signed" in Visual Studio The range is ch = -25; //valid ch = 135; //out of range, but not a syntax error. Compiler gives a warning 135 is out of range but still fits in 8-bits. When you have cout << ch; Output is the character for which the ASCII code is 135. However, when you have cout << (int) ch; Output is signed integer (2's complement) representation of 135 (in 32 bits) = which is -121 in 2's complement representation. Thus you see -121 as the output

unsigned char You can change the dafault behaviour to unsigned by changing the project properties Open the project's Property Pages dialog box. Click on "C/C++" Click on "Command Line" Add /J compiler option. You can also explicitly specify a char variable unsigned by putting the keyword unsigned before char. For non-negative one-byte integers. Since there are no negatives, no need to use 2's complement The range is For ASCII interpretation, signed and unsigned do not make a difference The ASCII character corresponding to the binary representation unsigned char ch; ch = 200; //valid; the ASCII character with code 200 ch = -25; //out of range, but not a syntax error. Compiler gives a warning -25 is out of range but can be represented in 2's complement as in binary and the unsigned interpretation of this bit string is 231. Thus: cout << ch; displays the character for which the ASCII code is 231.

int, short, long, long long
The "signed" integer types of C++ Uses 2's complement representation int Mostly used signed integer type of C++ Typically the number of bytes used is the word size of the processor So in 32 bit computers it is 4 bytes, but for 64-bit computers it should be 8 bytes However, Visual Studio fixed it to 4 bytes: thus, in CS204 we can assume that int always uses 4 bytes But if you port your code to another platform using another compiler, do not trust that int uses 4 bytes. Range: INT_MIN to INT_MAX (these are defined in limits.h or climits header file) -2n n-1-1 where n is the number of bits used 32 bits (our case):  -2,147,483, ,147,483,647 64 bits:  -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807 long (can also be used as long int) long num; //can also be defined as long int num; Signed integer that always use 4 bytes The range is the same as 32-bit int

int, short, long, long long
long long (can also be used as long long int) long long wow; //can also be defined as long long int wow; Microsoft specific 64-bit signed integer (always 64-bits) Do not use it for codes to be ported to other platforms, it won't work. Range: LLONG_MIN to LLONG_MAX ( ) short (can also be used as short int) short count; //can also be defined as short int count; Always 2 bytes Signed integer that always use 2 bytes Range: SHRT_MIN to SHRT_MAX (these are defined in limits.h or climits header file)  count = 31500; //valid count = 35000;//out of range, but not a syntax error. Compiler gives a warning So, what is the output of cout << count; ? It displays , why? Write in binary in 16-bits and interpret this bit string as a 2's complemented signed number = = – =

unsigned integers In order to store only non-negative values, char, int, short, long, long long can be defined as unsigned by putting unsigned keyword before the type name. unsigned int mynum; unsigned short cinekop; // same as unsigned short int cinekop; unsigned long lufer; //same as unsigned long int lufer; unsigned long long kofana; // same as unsigned long long int kofana; In unsigned representation there is no sign bit; most significant bit is part of the magnitude. Thus we do not need 2's complement. In this way, we can use the full range (2, 4 or 8 bytes) for zero and positive values. The ranges become (note that the positive range is almost doubled as compared to signed integers): 16-bit: to USHRT_MAX (defined in limits.h or climits header file)  32-bit: to UINT_MAX (defined in limits.h or climits header file)  ,294,967,295 64-bit: to ULLONG_MAX (defined in limits.h or climits header file)  ,446,744,073,709,551,615

unsigned integers Unsigned numbers does not store negatives, but nothing can stop us to assign a negative value to an unsigned variable  unsigned short num; num = -25; cout << num; -25 is negative so it is represented using 2's complement. The resulting bit string is then interpreted as an unsigned number since it is assigned to unsigned number (implicit type casting). -2510 = = So the output becomes 65511 Of course, it is not a normal programmer behavior to assign a negative value to an unsigned variable, but such things may unintentionally occur. If you use a literal or constant at the right-hand-side of assignment, then compiler may warn you (depending on the warning level). However, if rhs is an expression, then the problem occurs at run-time and compiler cannot see that problem. Thus, you have to know what happens in such situations to locate the problem easily.

Limits You can include limits.h which defines the ranges of integers (depending on your platform/computer) #include <limits.h> OR #include <climits> Tip: Type #include <limits.h> (or any other filename) in your program, then go to that line, and right click on the file name and choose “Open Document”. That will bring you this header file. You can do this in general and it will save you the effort lo locate the file.

Typecasting between signed and unsinged numbers
Typecasting may be done explicitly or sometimes it happens implicitly (e.g. When you assing an unsigned variable to a signed one, or vice versa) So, you should know how it executes Signed to unsigned typecasting Represent to signed number using 2's complement format. Interpret this bit string as unsinged If MSB is 0, then the signed and unsigned are the same If MSB is 1, then signed is negative. For unsigned conversion, MSB is not considered as the sign bit, it is interpreted as part of the magnitude. 2 slides ago, we had an example, but let us give another one. short ints = ; unsigned short intus = ints; //implicit typecasting cout << intus; Output is 35536, the same bit representation as

Typecasting between signed and unsinged numbers
Unsigned to signed typecasting Represent the unsigned number as bit string. Interpret this bit string as signed If MSB is 0, then the unsigned and signed are the same If MSB is 1, then interpret the bit string as a 2's complemented negaitve value I do not mean to take 2's complement. But you can take 2's complement to understand the magnitude of this negative number Examples unsigned short usnum = 30000; cout << (short) usnum; Output is 30000, MSB of usnum is 0. unsigned short usnum = 63000; Output is -2536, MSB of usnum is 1. Moral of the story behind typecasting: They are all the same bit strings; the only thing that changes is how to interpret it

Some tips about selecting integer type
You may consider to use an unsigned variable if you will store a non-negative number. Well, if you are too close to 0, this is a bit risky. Consider the following loop: unsigned int j; for (j = 5; j >= 0; j--) cout << j << endl; This loop is infinite. When j is 0, it is decremented and you expect to have -1. Actually it is -1 as the bit string representation (a bit string with all 1's in it). This bit string is the largest unsigned integer number when you typecast into unsigned int. Thus it is >=0. Moral of the story: use unsigned only if you make sure that the value of the variable will never go below 0. Otherwise use signed integers.

Do not mix signed and unsigned numbers in an expression. Some strange things may happen. Consider the following code: unsigned int a = 5; int b = -10; if (a+b < 0) cout << "hede" << endl; else cout << "hodo" << endl; You expect the output to be "hede", but it displays "hodo". Why? In C++, there is a rule saying that built-in operands of an operator must be of the same type. If they are different, one of them is implicitly typecasted to the other before evaluating the expression. Typical case: if you add an integer to a double, integer is converted to double before the operation. In the example above, signed (b) is typecasted to unsigned is a binary number with lots of 1's in 2's complement format. Thus as unsigned it is a big number. When added 5, the sum gets bigger and can never be less than 0. RULE: If there is a signed and an unsigned number in an expression, signed is automatically converted to unsigned by the compiler before the evaluation.

Which integer size to use? Of course, this depends on the possible range of values you want to store in this variable. Using the largest one all the time may cause unnecessary usage of memory. This is not good for efficiency. But on the other hand, allow yourself some margin to proactively defend against some unanticipated problems. If you want to store a constant or literal bigger than the capacity, the compiler warns you (at compile time) short num = 45000; //compiler warns However, if you assign an expression that goes beyond the capacity, then compiler cannot see this and cannot warn you. This is a big problem and technically called as "overflow" and we are going to see overflow today (if time permits, otherwise in the next lecture).

Wrap-up: Number Representations
Unsigned : for non-negative integers Two’s complement : for signed integers (zero, negative or positive) Unless otherwise noted (as unsigned), always assume that numbers we consider are in 2's complement representation. IEEE 754 floating-point : for real numbers (float, double) We did not add anything for float and double on top of what you know from CS201. At the end of these slides you can find how IEEE 754 floating-point representation works, but we will not talk about this and you are not responsible.

Arithmetic Overflow

Overflow In this subsection, we will see the related topic of overflow, which basically means that after an operation such as addition or subtraction the result is not correct due to the fact that the result cannot be represented in the allocated space. There is overflow in the following piece of code since a +b goes beyond the range covered by c unsigned char a = 200; unsigned char b = 255; unsigned char c = a + b; We will also give the rule about determining the value of c after the overflow. This may not look essential since the result is already wrong, but getting into that deep may help us to find out logic error during debugging. We will start with small cases where the storage is 4-bits to understand the basics and then later we will generalize to built-in types of C++ We will give the basics of overflow on addition and subtraction with two operands Other arithmetic operations and expressions with multiple operations may also cause overflow. We will generalize to this case at the end

How Can We Detect Arithmetic Overflow?
Having carry out of MSB? Arithmetic overflow is not always understood by having a carry out of the MSB If there is a carry out of the MSB, then we say that there is a "carry overflow", but this may or may not mean there is "arithmetic overflow" and the result is wrong. E.g. 7-6 = 7 + (-6) 0111 ( 7) + 1010 (-6) ( 1) There is a carry out of MSB; it is discarded and the result is correct! E.g = -7 + (-6) 1001 (-7) + 1010 (-6) ( 3) There is a carry out of MSB; it is discarded and the result is wrong! x x

How Can We Detect Arithmetic Overflow?
Having no carry out of MSB? Does not always mean that there is no arithmetic overflow. We may have arithmetic overflow even if there is no carry out of MSB E.g. 7+1 0111 ( 7) + 0001 ( 1) 1000 (-8) There is no carry out of MSB, but the result is incorrect! E.g. 2-3 = 2 + (-3) 0010 ( 2) + 1101 (-3) 1111 (-1) There is no carry out of MSB and the result is correct!

Overflow and 8 bit addition
1 1 1 1 120 +120 -16 Overflow! It fits, but it’s still overflow! Reminder: Max 2s complement range with 8 bits: -128 to +127 = 1x64 + 1x32 + 1x16 + 1x8 = 12010 = -1x x64 1x32 + 1x16 = -1610

Overflow – definition & detection
Overflow means that the right answer don’t fit ! If you think in decimal and know the ranges, it is easy to detect. = 240 and the range of signed 8-bit integer is  240 is not in this range, so there is overflow More formally, there is arithmetic overflow when the sign of the numbers is the same -AND- the sign of the result is different than the sign of the numbers

Detecting Overflow There can’t be an overflow when adding a positive and a negative number Why? Basically because the magnitude of the number gets smaller without changing sign There can’t be an overflow when signs are the same for subtraction Why? Same as above since arithmetically this is adding a positve to a negative. Overflow occurs when the value affects the sign: overflow when adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive and get a negative (similar to 1) or, subtract a positive from a negative and get a positive (similar to 2) Of course, this rule is for signed integers; for unsigned, we will see later

Visualizing Overflow Number 2’s-complement 0 0000 +1 0001
Let us visualize the reason of overflow on 4-bit case for signed integers (2's complement) Start with the first operand and (circularly) go up by the second operand for subtraction (circularly) go down by the second operand for addition Overflow occurs if our arithmetic operation causes to pass this red line (in any direction) Wrapping around ( 0  -1 or -1  0) does not mean overflow

Visualizing Overflow for char and short
Number ’s-complement short Number ’s-complement

Detecting Overflow – Complex Expressions
The rule of detecting the change of sign in the result applies to all signed integer types of C++. But only for simple addition and subtraction What about more complex operations? No simple formula for that; apply these steps Simply calculate using decimal arithmetic and see if it fits in the range. If does not fit, then there is overflow Convert the overflowed result in binary and truncate as it fits to n-bits (where n is the number of bits in the corresponding type) Interpret the truncated bit string in 2's complement logic Examples char d = 3*200+21; 621 is not between , so there is overflow. 62110 =  Discard the most significant two bits since they do not fit in 8 bits (storage for char). Resulting bit string is which is 109 (decimal) char d = 2*200+15; 415 is not between , so there is overflow. 41510 =  Discard the most significant bit since it does not fit in 8 bits (storage for char). Resulting bit string is which -97 is (2's compl.)

Detecting Overflow - unsigned integers
Dec. Number Binary number Let us visualize the overflow case on 4-bit unsigned signed integers We have two red lines here There is overflow if you go beyond 0 and beyond 15 If you add 1 to 15 you end up in binary and when you discard the overflow bit, the resulting value becomes 0. Similarly subtracting 1 from 0 yields 15 Generalization of this case to n-bit unsigned integers is trivial Max value is 2n-1 and number of bits in binary is n Evaluation of complex expression is similar to signed case Do the operation, convert to binary, discard the overflowed bits But this time interpret as unsigned number

Detecting Overflow in Programs
So far we have not talked much about automatic ways of detecting overflows in programs Only detecting the change in the sign bit for addition and subtraction of signed integers Unfortunately, there is no other silver bullet for detecting overflows once it occured Better if overflows are avoided To do so, you may use simple expressions and check the values of the operands to see if they are small enough not to cause overflow For example suppose a and b are two unsigned ints, a*b does not overflow if b < UINT_MAX / a

Floating Point Representation
SKIP – Not Covered in CS204 You may read if you are curious

Floating Point (a brief look)
We need a way to represent: numbers with fractions, e.g., very small numbers, e.g., very large numbers, e.g., 3.1 x 1020 Solution: A floating (decimal) point representation IEEE 754 floating point representation is the standard: ≡ +/- ……………….. X sign mantissa exponent single precision: 1 bit sign, 23 bit significand (mantissa), 8 bit exponent more bits for significand gives more accuracy more bits for exponent increases range Range approximately: 10–44 to 1038

IEEE Floating Point Std. - Details
The Mantissa The mantissa, also known as the significand, represents the precision bits of the number. To find out the value of the implicit leading bit, consider that any number can be expressed in scientific notation in many different ways. For example, the number five can be represented as any of these: 5.00 × 100 0.05 × 102 5000 × 10-3 In order to maximize the quantity of representable numbers, floating-point numbers are typically stored in normalized form. This basically puts the radix point after the first non-zero digit. In normalized form, five is represented as 5.0 × 100.

Floating Point – what floats?
For simplicity, let’s use a decimal representation and assume we have 1 digit for sign, 8 digits for the mantissa and 3 digits for the exponent: +/ We will “illustrate” the format for the number -10 = x So it will be stored as mantissa exponent The actual IEEE Floating point representation follows this principle, but differs from this in details: - normalization (floaing point comes after the first nonzero digit) - binary instead of decimal - exponent (not sign/magnitude but biased)

IEEE Floating Point Std. - Normalization
Since the only possible non-zero digit is 1, in the IEEE floating point standard, we can just assume a leading digit of 1, and don't need to represent it explicitly. As a result, the mantissa has effectively 24 bits of resolution, by way of 23 fraction bits.

IEEE Floating Point Std. - Binary
We convert decimal to binary, simply as: decimal: = - ( ) binary: (since we have bits for … … etc) canonical form: -1.1 x 2-1 (note: shifting the radix point by k is same as multip./dividing by radixk) Stored sign: - Stored mantissa: …0 since leading bit is always 1 Stored exponent: -1 (basically but I wont go into details, a bias is actually used) decimal: = binary: canonical form: x 23 Stored sign: + Stored mantissa: …0 since leading bit is always 1

Bias – why? Since we want to represent both positive and negative exponents, e.g and 10-11, we can do two things: Reserve a separate sign bit for the exponent Use only positive exponents, together with a bias The bias (e.g. 127) is subtracted from whatever is stored in the exponent, to find the real exponent Stored exponent= real exponent= – 127 = -127 Stored exponent=227 real exponent= 227 – 127 = 100

Bias of the Exponent The Exponent
The exponent field needs to represent both positive and negative exponents. To do this, a bias is added to the actual exponent in order to get the stored exponent. For IEEE single-precision floats, this value is 127. Thus, if the real exponent is zero, 127 is stored in the exponent field. if 200 is stored in the exponent field, it actually indicates a real exponent of ( ), or 73. Exponents of -127 (all 0s) and +128 (all 1s) are reserved for special numbers (NaN, Infnty)

IEEE 754 floating-point standard: summary
Leading “1” bit of significand is implicit Exponent is “biased” to make sorting easier all 0s is smallest exponent, all 1s is largest bias of 127 for single precision (note addition of the bias while storing, subtracting of the bias while converting to decimal) Decimal equivalent: (–1)sign ´ (1+significand) ´ 2exponent - bias Example: decimal: = - ( ) binary: -.11 canonical form: -1.1 x 2-1 (note: shifting the radix point by k is same as multip./dividing by radixk) stored exponent = 126 = Resulting IEEE single precision representation: sign mantissa exponent

A more complex example Let us encode the decimal number − using the IEEE 754 system. First we need to get the sign, the exponent and the fraction. Because it is a negative number, the sign is "1". Now, we write the number (without the sign; i.e. unsigned, no two's complement) using binary notation. The result is (notice how we represent .625) Next, let's move the radix point left, leaving only a 1 at its left: = × 26. This is the normalized floating point number. The mantissa is the part at the right of the radix point, filled with 0 on the right until we get all 23 bits. That is The exponent is 6, but we need to bias it and convert it to binary (so the most negative exponent is stored as 0, and all exponents are non-negative binary numbers). For the 32-bit IEEE 754 format, the bias is 127 and so the stored exponent is = 133. In binary, this is written as Putting them all together: This example is from wikipedia.

IEEE Floating Point: Ranges
Explanation for minimum positive (just a sign chg. for negative): = + (0+1) x = x 2-126 = 23 bits mantissa bits exponent Note1: Exponent “ ” is reserved for special numbers, so min is “ ” Note2: Approx. conversion between 2s powers and 10s powers: Ex = since 23.3 = 10 and 149/3.3 = 45

IEEE Floating Point Ranges
Explanation for maximum positive (just change sign for negative): = + ( ) x = x 2127 = = 254 23 bits mantissa bits exponent Note1: Since it represents the part after the radix point, “ …1” = , just as “.11” = 1-2-2 Note2: as exponent is reserved for special numbers, so max is

Summary Computer arithmetic is constrained by limited precision
Bit patterns have no inherent meaning but standards do exist two’s complement IEEE 754 floating point Computer instructions determine “meaning” of the bit patterns

Floating Point Complexities
In addition to overflow we can have “underflow” A number that is smaller than what is representable (e.g. < 2-126) Accuracy can be a big problem IEEE 754 keeps two extra bits, guard and round four rounding modes positive divided by zero yields “infinity” zero divide by zero yields “not a number” other complexities…

Data Representation Overflow Limits

Similar presentations

Presentation on theme: "Data Representation Overflow Limits"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Representation Overflow Limits

Similar presentations

Presentation on theme: "Data Representation Overflow Limits"— Presentation transcript:

Similar presentations

About project

Feedback