A gentle introduction to floating point arithmetic Ho Chun Hok Custom Computing Group Seminar 25 Nov 2005.

A gentle introduction to floating point arithmetic Ho Chun Hok (cho@doc.ic.ac.uk)cho@doc.ic.ac.uk Custom Computing Group Seminar 25 Nov 2005

IEEE 754: floating point standard Normal numbers (when exponent > 0 and < max exponent) –v = (-1)^s x 2^exponent x (1.fraction) Subnormal numbers (when exponent = 0) –v = (-1)^s x 2^exponent x (0.fraction) Special numbers (when exponent = max exponent) –Infinity, Nan (not a number) precisions –Single: esize = 8, fsize = 23, vsize = 32 –Double: esize = 11, fsize = 52, vsize = 64 –Double extended, vsize > 64 Operations: –+, -, x, /, sqrt, f2i, i2f, compare, f2d, d2f Rounding –Nearest even, +inf, -inf, towards 0 sexponentfraction esizefsize Let v = vsize

What IEEE 754 standard supposes to be Approximation to real number with expected error –the epsilon can of any real number can be determined when mapping to floating point number Results of all operations can be correctly rounded in case of inexact result Ensure some math properties hold (in general), x+y==y+x, -(-a) == a, a>=b & c>=0  a*c >= b*c, x+0 == x, y*y >= 0 All exception can be detected –Using exception flags Same results across different machines

How to ensure the standard? Processor? –Rounding numbers in different mode –Gradual underflow –Raise exceptions Operating System? –Handle exception –Handle function which may not be supported in hardware (what if a processor cannot handle subnormal number) –Keep track of the floating point unit state, (precision, rounding mode) Programming Language? –Well-defined semantic for floating point (yes, we have infamous JAVA language) Compiler? –Preserve the semantic defined in that language Programmer? –read: What Every Computer Scientist Should Know About Floating- Point Arithmetic

Case study 1 int main (void) { double ref,index; double tmp; int i; ref = (double) 169.0/ (double) 170.0; for(i=0;i<250;i++){ index=i; if(ref == (double) (index/(index+1.0)) ) break; } printf("i=%d\n", i); return 0; }

Visual C compiler, running on P-M Same result on lulu (pentium 3) and irina (Xeon)

GCC, running on Pentium 4 (skokie)

gcc: fld1 faddp %st,%st(1) fldl 0xfffffff0(%ebp) fdivp %st,%st(1) fldl 0xfffffff8(%ebp) fxch %st(1) fucompp fnstsw %ax and $0x45,%ah cmp $0x40,%ah je 0x80483d2 VCC: fld qword ptr [ebp-10h] fadd qword ptr [__real@8@3fff8000000000000000 (00426028)] fdivr qword ptr [ebp-10h] fcomp qword ptr [ebp-8] fnstsw ax test ah,40h je main+5Fh (0040106f) jmp main+61h (00401071) VCC use normal stack (ebp) (64-bits) to store the result, and compare with a 64bit double precision value GCC use advanced FPU register stack (st) (80-bits) to store the result, and compare with a 64bit double precision value

Case study 1 It’s compiler issue Using more precision to calculate the intermediate result is a good idea Compiler should convert the 80-bit floating point number to 64-bit before comparison And it’s programmer issue too –Equality test between FP variables is dangerous –We can detect the problem before it hurts…. It is not easy to compliance with the standard

Case study 2 Calculate When x is large, result == 0, rather than Beware, even everything compliance with standard, the standard cannot guarantee the result is always correct Again, programmer should detect this before it hurts –Define routine to trap the exception –Exceptions are not errors as long as they are handled correctly

Case Study 3 Jean-Michel Muller’s Recurrence

Using double extended precision (80-bits) x[2] = 5.590164e+00 x[3] = 5.633431e+00 x[4] = 5.674649e+00 x[5] = 5.713329e+00 x[6] = 5.749121e+00 x[7] = 5.781811e+00 x[8] = 5.811314e+00 x[9] = 5.837660e+00 x[10] = 5.861018e+00 x[11] = 5.882514e+00 x[12] = 5.918471e+00 x[13] = 6.240859e+00 x[14] = 1.115599e+01 x[15] = 5.279838e+01 x[16] = 9.469105e+01 x[17] = 9.966651e+01 x[18] = 9.998007e+01 x[19] = 9.999881e+01 Converge to 100.0, it seems correct

Case Study 3 This series can converge to either 5, 6, and 100 –Depends on the value of x0, x1 –If x0 = 11/2, x1 = 61/11, the series should be converged to 6 –Little round off error may affect the result dramatically –We can calculate the result analytically by substituting In general, it’s very difficult to detect this error

Case Study 4 Table maker’s dilemma –If we want n-digit accuracy for elementary function like sine, cosine, we (in most case) need to calculate the digit up to n+2 digit –What if the last 2-digit is “10”? We can calculate last 3 digit –What if the last 3-digit is “100”? We can calculate last 4 digit –What if… The result of most elementary function library (e.g. libm) is not correctly rounded in some cases

Conclusion We have a standard representation for floating point number Comforting the standard requires collaboration between different parties Even if we have standard-compliance platform, cautions must be taken when “underflow, overflow”, and make sure the algorithm is “numerically stable” When using elementary function, don’t expect the result can be comparable between different machine –Elementary function is NOT included in the IEEE standard Floating point, when use properly, can do something serious

A gentle introduction to floating point arithmetic Ho Chun Hok Custom Computing Group Seminar 25 Nov 2005.

Similar presentations

Presentation on theme: "A gentle introduction to floating point arithmetic Ho Chun Hok Custom Computing Group Seminar 25 Nov 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A gentle introduction to floating point arithmetic Ho Chun Hok Custom Computing Group Seminar 25 Nov 2005.

Similar presentations

Presentation on theme: "A gentle introduction to floating point arithmetic Ho Chun Hok Custom Computing Group Seminar 25 Nov 2005."— Presentation transcript:

Similar presentations

About project

Feedback