Problems with Floating-Point Representations Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright.

Slides:



Advertisements
Similar presentations
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Advertisements

2009 Spring Errors & Source of Errors SpringBIL108E Errors in Computing Several causes for malfunction in computer systems. –Hardware fails –Critical.
CENG536 Computer Engineering department Çankaya University.
Laplace Transform Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2008 by Douglas Wilhelm Harder.
Topics covered: Floating point arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
2-1 Chapter 2 - Data Representation Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer Architecture.
Principles of Computer Architecture Miles Murdocca and Vincent Heuring Chapter 2: Data Representation.
Differentiation and Richardson Extrapolation
Topics in Applied Mathematics Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Waterloo, Ontario, Canada.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Faculty of Computer Science © 2006 CMPUT 229 Floating Point Representation Operating with Real Numbers.
ECIV 201 Computational Methods for Civil Engineers Richard P. Ray, Ph.D., P.E. Error Analysis.
Number Systems Standard positional representation of numbers:
1 Binary Arithmetic, Subtraction The rules for binary arithmetic are: = 0, carry = = 1, carry = = 1, carry = = 0, carry =
CSE 378 Floating-point1 How to represent real numbers In decimal scientific notation –sign –fraction –base (i.e., 10) to some power Most of the time, usual.
Floating Point Numbers
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
1 Binary Numbers Again Recall that N binary digits (N bits) can represent unsigned integers from 0 to 2 N bits = 0 to 15 8 bits = 0 to bits.
Topic 4 Computer Mathematics and Logic
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Simple Data Type Representation and conversion of numbers
Numbers and number systems
Binary Real Numbers. Introduction Computers must be able to represent real numbers (numbers w/ fractions) Two different ways:  Fixed-point  Floating-point.
Information Representation (Level ISA3) Floating point numbers.
Fixed-Point Iteration Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007 by Douglas Wilhelm.
Number Systems Part 2 Numerical Overflow Right and Left Shifts Storage Methods Subtraction Ranges.
MATH 212 NE 217 Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Waterloo, Ontario, Canada Copyright © 2011.
Welcome to ECE 204 Numerical Methods for Computer Engineers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI 230 Information Representation: Negative and Floating Point.
Proof by Induction.
MATH 212 NE 217 Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Waterloo, Ontario, Canada Copyright © 2011.
Topic 2 – Introduction to Computer Codes. Computer Codes A code is a systematic use of a given set of symbols for representing information. As an example,
College of Engineering Representing Numbers in a Computer Section B, Spring 2003 COE1361: Computing for Engineers COE1361: Computing for Engineers 1 COE1361:
Floating Point. Agenda  History  Basic Terms  General representation of floating point  Constructing a simple floating point representation  Floating.
Data Representation in Computer Systems
Floating Point (a brief look) We need a way to represent –numbers with fractions, e.g., –very small numbers, e.g., –very large numbers,
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Round-off Errors.
Binary Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007 by Douglas Wilhelm Harder.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Round-off Errors and Computer Arithmetic. The arithmetic performed by a calculator or computer is different from the arithmetic in algebra and calculus.
CSC 221 Computer Organization and Assembly Language
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Floating Point Arithmetic
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI N305 Information Representation: Floating Point Representation.
MATH 212 NE 217 Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Waterloo, Ontario, Canada Copyright © 2011.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Decimal Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007 by Douglas Wilhelm Harder.
Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Binary Arithmetic.
Errors in Numerical Methods
MATH 212 NE 217 Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Waterloo, Ontario, Canada Copyright © 2011.
Numbers in Computers.
ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,
Binary Numbers The arithmetic used by computers differs in some ways from that used by people. Computers perform operations on numbers with finite and.
Module 2.2 Errors 03/08/2011. Sources of errors Data errors Modeling Implementation errors Absolute and relative errors Round off errors Overflow and.
1 Complete binary trees Outline Introducing complete binary trees –Background –Definitions –Examples –Logarithmic height –Array storage.
Cosc 2150: Computer Organization Chapter 9, Part 3 Floating point numbers.
Floating Point Representations
Binary Numbers The arithmetic used by computers differs in some ways from that used by people. Computers perform operations on numbers with finite and.
Open Addressing: Quadratic Probing
Outline Introducing perfect binary trees Definitions and examples
A Level Computing Component 2
Dr. Clincy Professor of CS
Approximations and Round-Off Errors Chapter 3
Storing Negative Integers
Presentation transcript:

Problems with Floating-Point Representations Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007 by Douglas Wilhelm Harder. All rights reserved. ECE 204 Numerical Methods for Computer Engineers

Problems with Floating-Point Representations This topic will cover a number of the problems with using a floating-point representation, including: –underflow and overflow –subtractive cancellation –adding large and small numbers –non-associative (a + b) + c  a + (b + c)

Problems with Floating-Point Representations Underflow and Overflow In our six–decimal-digit floating-point representation, the largest number we can represent is  The largest double is 1.8  : >> format long; realmax realmax = e+308 >> format hex; realmax realmax = 7fefffffffffffff or more correctly, 

Problems with Floating-Point Representations Underflow and Overflow Any number larger than these values cannot be represented using these formats To solve this problem, we can introduce a floating-point infinity: >> format long; 2e308 ans = Inf >> format hex; 2e308 ans = 7ff

Problems with Floating-Point Representations Underflow and Overflow The properties of infinity include: –any real plus infinity is infinity –one over infinity is 0 –any positive number times infinity is infinity –any negative number times infinity is –infinity For example: >> Inf + 1e100>> 325*Infans = Inf >> 1/Inf>> -2*Inf ans = 0ans = -Inf

Problems with Floating-Point Representations Underflow and Overflow The introduction of a floating-point infinity allows computations to continue and removes the necessity of signaling overflows through exceptions An example where infinity may not cause a problem is where its reciprocal is immediately taken: >> 5 + 1/2e400 ans = 5

Problems with Floating-Point Representations Underflow and Overflow Our six–decimal-digit floating-point representation, the smallest number we can represent is  10 –49 The smallest positive double (using the normal representation) is 2.2  10 –308 : >> format long; realmin realmax = e-308 >> format hex; realmin realmax = or more correctly, 2 –1022

Problems with Floating-Point Representations Underflow and Overflow Storing real numbers on a computer: –we must use a fixed amount of memory, –we should be able to represent a wide range of numbers, both large and small, –we should be able to represent numbers with a small relative error, and –we should be able to easily test if one number is greater than, equal to, or less than another

Problems with Floating-Point Representations Underflow and Overflow Any number smaller than these values is represented by 0 This is represented by a double with all 0s, with the possible exception of the sign bit: >> format hex; 0 ans = >> -0 ans = >> format long; 1/0 ans = Inf >> 1/-0 ans = -Inf

Problems with Floating-Point Representations Underflow and Overflow You may have noticed that we did not use both the largest and smallest exponents: >> format hex; realmax realmax = 7fefffffffffffff >> realmin realmin = The largest and smallest exponents should have been 7ff and 000, respectively

Problems with Floating-Point Representations Underflow and Overflow These “special” exponents are used to represent special numbers, such as: –infinity 7ff000 ··· fff000 ··· –not-a-number 7ff800 ··· – ··· ··· –denormalized numbers numbers existing between 0 and realmin, but at reduced precision

Problems with Floating-Point Representations Underflow and Overflow Thus, we can classify numbers which: –are represented by 0, –are not represented with full precision, –are represented using 53 bits of precision, and –are represented by infinity

Problems with Floating-Point Representations Subtractive Cancellation The next problem we will look at deals with subtracting similar numbers Suppose we take the difference between  and the 3-digit approximation 3.14 using our six-digit floating-point approximation Performing the calculation: – = =  10 –3 which has the representation

Problems with Floating-Point Representations Subtractive Cancellation How accurate is this difference? Recall that the 3.14 is precisely by our floating-point representation, but our representation of  has a relative error of By calculating the difference, of almost- equal numbers, we loose a significant amount of precision

Problems with Floating-Point Representations Subtractive Cancellation The actual value of the difference is  – 3.14 = ··· and therefore, the relative error of our approximation of this difference is Thus, the relative error which we were trying to calculate is significant: 25.58%

Problems with Floating-Point Representations Subtractive Cancellation Subtractive cancellation is the phenomenon where the subtraction of similar numbers results in a significant reduction in precision

Problems with Floating-Point Representations Subtractive Cancellation As another example, recall the definition of the derivative: Assuming that this limit converges, then using a smaller and smaller value of h should result in a very good approximation to f (1) (x)

Problems with Floating-Point Representations Subtractive Cancellation Let’s try this out with f(x) = sin(x) and let us approximate f (1) (1) From calculus, we know that the actual derivative is cos(1) = ·· Let us use Matlab to approximate this derivative using h = 0.1, 0.001, ,...

Problems with Floating-Point Representations Subtractive Cancellation >> for i=1:8 h = 10^-i; (sin(1 + h) - sin(1))/h end ans = ans = ans = ans = ans = ans = ans = ans =

Problems with Floating-Point Representations Subtractive Cancellation >> for i=8:16 h = 10^-i; (sin(1 + h) - sin(1))/h end ans = ans = ans = ans = ans = ans = ans = ans = ans = 0

Problems with Floating-Point Representations Subtractive Cancellation What happened here? With h = 10 –8, we had an approximation which has a relative error of 2.6  10 –8, or 7 decimial-digits of precision With smaller and smaller values of h, the error, however, increases until we have a completely useless approximation when h = 10 –16

Problems with Floating-Point Representations Subtractive Cancellation Looking at sin(1 + h) and sin(1) when h = 10 –12 >> h = 1e-12 h = e-12 >> sin(1 + h) ans = >> sin(1) ans = Consequently, we are subtracting two numbers which are almost equal

Problems with Floating-Point Representations Subtractive Cancellation The next slide shows the bits using h = 2 –n for n = 1, 2,..., 53 Note that double-precision floating-point numbers have 53 bits of precision The red digits show the results are a result of the subtractive cancellation

ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = ans = Approximating the derivative of sin(x) at x = 1 : –green digits show accuracy, while –red digits show loss of precision >> for i=1:53 h = 2^-i; (sin(1 + h) - sin(1))/h end

Problems with Floating-Point Representations Subtractive Cancellation Later in this course, we will find a formula which will approximate the derivative of sin(x) at x = 1 using h = by which is significantly closer to cos(1) = than any approximation we saw before

Problems with Floating-Point Representations Subtractive Cancellation Thus, we cannot simply use the formulae covered in Calculus to calculate numbers numerically We will now see how an algebraic formula you learned in high-school can also fail: –the quadratic equation

Problems with Floating-Point Representations Subtractive Cancellation Rather than using doubles, we will use our six-digit floating-point numbers to show how the quadratic formula can fail Suppose we wish to find the smaller root of the quadratic equation x x This equation has roots at x = – , x = –

Problems with Floating-Point Representations Subtractive Cancellation Using four decimal-digits of precision for each calculation, we find that our approximation to the smaller of the two roots is x = – The relative error of this approximation is , or 34%

Problems with Floating-Point Representations Subtractive Cancellation Approximating the larger of the two roots, we get x = –144.2 The relative error of this approximation is only , or % Why does one formula work so well while the other fails so miserably?

Problems with Floating-Point Representations Subtractive Cancellation Stepping through the calculation: b = b 2 = ac = b 2 – 4ac = The actual value is – ···

Problems with Floating-Point Representations Non-Associativity Normally, the operations of addition and multiplication are associative, that is: (a + b) + c = a + (b + c) (ab)c = a(bc) Unfortunately, floating-point numbers are not associative If we add a large number to a small number, the large number dominates: = 5593.

Problems with Floating-Point Representations Non-Associativity Consider the example – If we calculate the first sum first: ( ) – = – = 0.35 If we calculate the second sum first: (54.73 – 54.39) = =

Problems with Floating-Point Representations Order of Operations Consider calculating the following sum in Matlab: The correct answer is answer, to 20 decimal-digits of precision, is

Problems with Floating-Point Representations Order of Operations Adding the numbers in the natural order, from 1 to 10 6, we get the following result: Adding the number in the reverse order, we get the result The second result is off by only the last digit (and only by 0.76)

Problems with Floating-Point Representations Order of Operations To see why this happens, consider decimal floating-point model which stores only four decimal-digits of precision: Adding from left to right, we get: ( ) = = 52.37

Problems with Floating-Point Representations Order of Operations Adding the expression from right to left, we get: ( ) = = This second value has a smaller relative error when compared to the correct answer (if we keep all precision) of

Usage Notes These slides are made publicly available on the web for anyone to use If you choose to use them, or a part thereof, for a course at another institution, I ask only three things: –that you inform me that you are using the slides, –that you acknowledge my work, and –that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides Sincerely, Douglas Wilhelm Harder, MMath