Presentation is loading. Please wait.

Presentation is loading. Please wait.

Floating Point Computation

Similar presentations


Presentation on theme: "Floating Point Computation"— Presentation transcript:

1 Floating Point Computation
Jyun-Ming Chen

2 Contents Sources of Computational Error
Computer Representation of (Floating-point) Numbers Efficiency Issues

3 Sources of Computational Error
Converting a mathematical problem to numerical problem, one introduces errors due to limited computation resources: round off error (limited precision of representation) truncation error (limited time for computation) Misc. Error in original data Blunder (programming/data input error) Propagated error

4 Supplement: Error Classification (Hildebrand)
Gross error: caused by human or mechanical mistakes Roundoff error: the consequence of using a number specified by n correct digits to approximate a number which requires more than n digits (generally infinitely many digits) for its exact specification. Truncation error: any error which is neither a gross error nor a roundoff error. Frequently, a truncation error corresponds to the fact that, whereas an exact result would be afforded (in the limit) by an infinite sequence of steps, the process is truncated after a certain finite number of steps.

5 Common measures of error
Definitions total error = round off + truncation Absolute error = | numerical – exact | Relative error = Abs. error / | exact | If exact is zero, rel. error is not defined

6 Ex: Round off error Representation consists of finite number of digits
Implication: real-number is discrete (more later) R

7 Watch out for printf !! By default, “%f” prints out 6 digits behind decimal point.

8 Ex: Numerical Differentiation
Evaluating first derivative of f(x) Truncation error

9 Numerical Differentiation (cont)
Select a problem with known answer So that we can evaluate the error!

10 Numerical Differentiation (cont)
Error analysis h  (truncation) error  What happened at h = ?!

11 Ex: Polynomial Deflation
F(x) is a polynomial with 20 real roots Use any method to numerically solve a root, then deflate the polynomial to 19th degree Solve another root, and deflate again, and again, … The accuracy of the roots obtained is getting worse each time due to error propagation

12 Computer Representation of Floating Point Numbers
Floating point VS. fixed point Decimal-binary conversion Standard: IEEE 754 (1985)

13 Floating VS. Fixed Point
Decimal, 6 digits (positive number) fixed point: with 5 digits after decimal point , … , Floating point: 2 digits as exponent (10-base); 4 digits for mantissa (accuracy) 0.001x10-99, … , 9.999x1099 Comparison: Fixed point: fixed accuracy; simple math for computation (sometimes used in graphics programs) Floating point: trade accuracy for larger range of representation

14 Decimal-Binary Conversion
Ex: 134 (base 10) Ex: (base 10) Ex: 0.1 (base 10)

15 Floating Point Representation
Fraction, f Usually normalized so that Base, b 2 for personal computers 16 for mainframe Exponent, e

16 Understanding Your Platform

17 Padding How about

18 IEEE 754-1985 Purpose: make floating system portable
Defines: the number representation, how calculation performed, exceptions, … Single-precision (32-bit) Double-precision (64-bit)

19 Number Representation
S: sign of mantissa Range (roughly) Single: to 1038 Double: to 10307 Precision (roughly) Single: 7 significant decimal digits Double: 15 significant decimal digits Describe how these are obtained

20 Implication When you write your program, make sure the results you printed carry the meaningful significant digits.

21 Implicit One Normalized mantissa to increase one extra bit of precision Ex: –3.5

22 Exponent Bias Ex: in single precision, exponent has 8 bits
(0) to (255) Add an offset to represent +/ – numbers Effective exponent = biased exponent – bias Bias value: 32-bit (127); 64-bit (1023) Ex: 32-bit (128): effective exp.= =1

23 Ex: Convert – 3.5 to 32-bit FP Number

24 Examine Bits of FP Numbers
Explain how this program works

25 The “Examiner” Use the previous program to Observe how ME work
Test subnormal behaviors on your computer/compiler Convince yourself why the subtraction of two nearly equal numbers produce lots of error NaN: Not-a-Number !?

26 Design Philosophy of IEEE 754
[s|e|m] S first: whether the number is +/- can be tested easily E before M: simplify sorting Represent negative by bias (not 2’s complement) for ease of sorting [biased rep] –1, 0, 1 = 126, 127, 128 [2’s compl.] –1, 0, 1 = 0xFF, 0x00, 0x01 More complicated math for sorting, increment/decrement

27 Exceptions Overflow: Underflow Dwarf Machine Epsilon (ME)
±INF: when number exceeds the range of representation Underflow When the number are too close to zero, they are treated as zeroes Dwarf The smallest representable number in the FP system Machine Epsilon (ME) A number with computation significance (more later)

28 Extremities E : (1…1) E : (0…0) M (0…0): infinity
More later E : (1…1) M (0…0): infinity M not all zeros; NaN (Not a Number) E : (0…0) M (0…0): clean zero M not all zero: dirty zero (see next page)

29 Not-a-Number Numerical exceptions Often cause program to stop running
Sqrt of a negative number Invalid domain of trigonometric functions Often cause program to stop running

30 Extremities (32-bit) Max: Min (w/o stepping into dirty-zero) 1.
(1.111…1) =( …1) 21272128 1. (1.000…0)21-127=2-126

31 Dirty-Zero (a.k.a. denormals)
a.k.a.: also known as Dirty-Zero (a.k.a. denormals) No “Implicit One” IEEE 754 did not specify compatibility for denormals If you are not sure how to handle them, stay away from them. Scale your problem properly “Many problems can be solved by pretending as if they do not exist”

32 Dirty-Zero (cont) 2-126 2-126 (Dwarf: the smallest representable)
2-126 denormals dwarf 2-126 2-127 2-128 2-129 (Dwarf: the smallest representable)

33 Drawf (32-bit) Value: 2-149

34 Machine Epsilon (ME) Definition This is not the same as the dwarf
smallest non-zero number that makes a difference when added to 1.0 on your working platform This is not the same as the dwarf Why 1.0?

35 Computing ME (32-bit) 1+eps Getting closer to 1.0 ME:
( ) –1.0 = 2-23  1.12  10-7

36 Effect of ME

37 Significance of ME Never terminate the iteration on that 2 FP numbers are equal. Instead, test whether |x-y| < ME

38 Numerical Scaling Number Density: there are as many IEEE 754 numbers between [1.0, 2.0] as there are in [256, 512] Revisit: “roundoff” error ME: a measure of density near the 1.0 Implication: Scale your problem so that intermediate results lie between 1.0 and 2.0 (where numbers are dense; and where roundoff error is smallest) R

39 Scaling (cont) Performing computation on denser portions of real line minimizes the roundoff error but don’t over do it; switch to double precision will easily increase the precision The densest part is near subnormal, if density is defined as numbers per unit length

40 How Subtraction is Performed on Your PC
Steps: convert to Base 2 Equalize the exponents by adjusting the mantissa values; truncate the values that do not fit Subtract mantissa normalize

41 Subtraction of Nearly Equal Numbers
Base 10: – 1. Significant loss of accuracy (most bits are unreliable)

42 Theorem of Loss Precision
x, y be normalized floating point machine numbers, and x>y>0 If then at most p, at least q significant binary bits are lost in the subtraction of x-y. Interpretation: “When two numbers are very close, their subtraction introduces a lot of numerical error.”

43 Implications When you program: You should write these instead:
Every FP operation introduces error, but the subtraction of nearly equal numbers is the worst and should be avoided whenever possible

44 Efficiency Issues Horner Scheme program examples

45 Horner Scheme For polynomial evaluation Compare efficiency

46 Accuracy vs. Efficiency

47 Good Coding Practice

48 On Arrays …

49 Issues of PI 3.14 is often not accurate enough
4.0*atan(1.0) is a good substitute

50 Compare:

51 Exercise Explain why Explain why converge when implemented numerically

52 Exercise Why Me( ) does not work as advertised?
Construct the 64-bit version of everything Bit-Examiner Dme( ); 32-bit: int and float. Can every int be represented by float (if converted)?


Download ppt "Floating Point Computation"

Similar presentations


Ads by Google