Download presentation

Presentation is loading. Please wait.

Published bySonya Pascal Modified over 2 years ago

1
Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms

2
© Copyright IBM Corp. 2005 Outline Hardware floating-point division The case for software division Software division algorithms Special cases/tradeoffs Performance results Automatic generation

3
© Copyright IBM Corp. 2005 Hardware Division PPC fdiv, fdivs Advantages ƒaccurate (correctly rounded) ƒhandles exceptional cases (Inf, NaN) ƒlower latency than SW Disadvantages ƒoccupies FPU completely ƒinhibits parallelism

4
© Copyright IBM Corp. 2005 Alternatives to HW division Vector libraries ƒMASS ƒhigher overhead, greater speedup In-lined software division ƒlow overhead, medium speedup

5
© Copyright IBM Corp. 2005 Rationale for Software Division Write SW division algorithm in terms of HW arithmetic instructions ƒNewton's method or Taylor series Latency will be higher than HW division But...SW instructions can be interleaved, so throughput may be better Requires enough independent instructions to interleave ƒloop of divisions ƒother work

6
© Copyright IBM Corp. 2005 Newton's Method To find x such that f(x) = 0, Initial guess x 0 x n+1 = x n - f(x n )/f'(x n ), n=0, 1, 2,... Provided x 0 is close enough ƒx n converges to x ƒIt converges quadratically |x n+1 -x| < c|x n -x|^2 ƒNumber of bits of accuracy doubles with each iteration

7
© Copyright IBM Corp. 2005 Newton's Method

8
© Copyright IBM Corp. 2005 Newton Iteration for Division For 1/b, let f(x) = 1/x - b For a/b, use a*(1/b) or f(x) = a/x - b Algorithm for 1/b ƒx 0 ~ 1/b initial guess ƒe 0 = 1 - b*y 0 ƒx 1 = x 0 + e 0 *x 0 ƒe 1 = e 0 *e 0 ƒx 2 = x 1 + e 1 *x 1 ƒetc...

9
© Copyright IBM Corp. 2005 How Many Iterations Needed? Power5 reciprocal estimate instructions ƒFRES (single precision), FRE (double prec.) ƒ|relative error| <= 2^(-8) Floating-point precision ƒsingle:24 bits ƒdouble:53 bits Newton iterations ƒerror: 2^(-16), 2^(-32), 2^(-64), 2^(-128) ƒsingle: 2 iterations for 1 ulp ƒdouble:3 iterations for 1 ulp ƒ+1 iteration for correct rounding (0.5 ulps)

10
© Copyright IBM Corp. 2005 Taylor Series for Reciprocal x 0 ~ 1/b initial guess e = 1 - b x 0 1/b = x 0 /(b x 0 ) = x 0 (1/(1-e)) = x 0 (1 + e + e^2 + e^3 + e^4 +...) Algorithm (6 terms) ƒe = 1 - d*x 0 ƒt 1 = 0.5 + e * e ƒq 1 = x 0 + x 0 * e ƒt 2 = 0.75 + t 1 *t 1 ƒt 3 = q 1 *e ƒq 2 = x 0 + t 2 *t 3

11
© Copyright IBM Corp. 2005 Speed/Accuracy tradeoff IBM compilers have -qstrict/-qnostrict -qstrict: SW result should match HW division exactly -qnostrict: SW result may be slightly less accurate for speed

12
© Copyright IBM Corp. 2005 Exceptions Even when a/b is representable... 1/b may underflow ƒa ~ b ~ huge, a/b ~ 1, 1/b denormalized ƒCauses loss of accuracy 1/b may overflow ƒa, b denormalized, a/b ~ 1, 1/b = Inf ƒCauses SW algorithm to produce NaN Handle with tests in algorithm ƒUse HW divide for exceptional cases

13
© Copyright IBM Corp. 2005 Algorithm variations User callable built-in functions ƒswdiv(a,b): double precision, checking ƒswdivs(a,b): single precision, checking ƒswdiv_nochk(a,b): double, non-checking ƒswdivs_nochk(a,b): single, non-checking Accuracy of swdiv, swdiv_nochk depends on -qstrict/-qnostrict _nochk versions faster but have argument restrictions

14
© Copyright IBM Corp. 2005 Accuracy and Performance Power5 speedup ratio Power4 speedup ratio Power5 ulps max error Power4 ulps max error swdivs1.07 1.050.5 swdivs_nochk1.461.280.5 swdiv strict1.050.5 swdiv nostrict1.501.5 swdiv_nochk strict 1.510.5 swdiv_nochk nostrict 1.771.5

15
© Copyright IBM Corp. 2005 Automatic Generation of Software Division The swdivs and swdiv algorithms can also be automatically generated by the compiler Compiler can detect situations where throughput is more important than latency

16
© Copyright IBM Corp. 2005 Automatic Generation of Software Division In straight-line code, we use a heuristic that calculates how much FP can be executed in parallel ƒindependent instructions are good, especially other divides ƒdependent instructions are bad (they increase latency)

17
© Copyright IBM Corp. 2005 Automatic Generation of Software Division In modulo scheduled loops software-divide code can be pipelined, interleaving multiple iterations Divides are expanded if divide does not appear in a recurrence (cyclic data- dependence)

18
© Copyright IBM Corp. 2005 Summary Software divide algorithms ƒuser callable ƒcompiler generated Loops of divides ƒup to 1.77x speedup UMT2K benchmark ƒ1.19x speedup

Similar presentations

OK

Lecture 22 Review of floating point representation from last time The IEEE floating point standard (notes) Quit early because half class still not back.

Lecture 22 Review of floating point representation from last time The IEEE floating point standard (notes) Quit early because half class still not back.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google