Presentation is loading. Please wait.

Presentation is loading. Please wait.

Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms.

Similar presentations


Presentation on theme: "Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms."— Presentation transcript:

1 Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms

2 © Copyright IBM Corp Outline  Hardware floating-point division  The case for software division  Software division algorithms  Special cases/tradeoffs  Performance results  Automatic generation

3 © Copyright IBM Corp Hardware Division  PPC fdiv, fdivs  Advantages ƒaccurate (correctly rounded) ƒhandles exceptional cases (Inf, NaN) ƒlower latency than SW  Disadvantages ƒoccupies FPU completely ƒinhibits parallelism

4 © Copyright IBM Corp Alternatives to HW division  Vector libraries ƒMASS ƒhigher overhead, greater speedup  In-lined software division ƒlow overhead, medium speedup

5 © Copyright IBM Corp Rationale for Software Division  Write SW division algorithm in terms of HW arithmetic instructions ƒNewton's method or Taylor series  Latency will be higher than HW division  But...SW instructions can be interleaved, so throughput may be better  Requires enough independent instructions to interleave ƒloop of divisions ƒother work

6 © Copyright IBM Corp Newton's Method  To find x such that f(x) = 0,  Initial guess x 0  x n+1 = x n - f(x n )/f'(x n ), n=0, 1, 2,...  Provided x 0 is close enough ƒx n converges to x ƒIt converges quadratically |x n+1 -x| < c|x n -x|^2 ƒNumber of bits of accuracy doubles with each iteration

7 © Copyright IBM Corp Newton's Method

8 © Copyright IBM Corp Newton Iteration for Division  For 1/b, let f(x) = 1/x - b  For a/b, use a*(1/b) or f(x) = a/x - b  Algorithm for 1/b ƒx 0 ~ 1/b initial guess ƒe 0 = 1 - b*y 0 ƒx 1 = x 0 + e 0 *x 0 ƒe 1 = e 0 *e 0 ƒx 2 = x 1 + e 1 *x 1 ƒetc...

9 © Copyright IBM Corp How Many Iterations Needed?  Power5 reciprocal estimate instructions ƒFRES (single precision), FRE (double prec.) ƒ|relative error| <= 2^(-8)  Floating-point precision ƒsingle:24 bits ƒdouble:53 bits  Newton iterations ƒerror: 2^(-16), 2^(-32), 2^(-64), 2^(-128) ƒsingle: 2 iterations for 1 ulp ƒdouble:3 iterations for 1 ulp ƒ+1 iteration for correct rounding (0.5 ulps)

10 © Copyright IBM Corp Taylor Series for Reciprocal  x 0 ~ 1/b initial guess  e = 1 - b x 0  1/b = x 0 /(b x 0 ) = x 0 (1/(1-e)) = x 0 (1 + e + e^2 + e^3 + e^4 +...)  Algorithm (6 terms) ƒe = 1 - d*x 0 ƒt 1 = e * e ƒq 1 = x 0 + x 0 * e ƒt 2 = t 1 *t 1 ƒt 3 = q 1 *e ƒq 2 = x 0 + t 2 *t 3

11 © Copyright IBM Corp Speed/Accuracy tradeoff  IBM compilers have -qstrict/-qnostrict  -qstrict: SW result should match HW division exactly  -qnostrict: SW result may be slightly less accurate for speed

12 © Copyright IBM Corp Exceptions  Even when a/b is representable...  1/b may underflow ƒa ~ b ~ huge, a/b ~ 1, 1/b denormalized ƒCauses loss of accuracy  1/b may overflow ƒa, b denormalized, a/b ~ 1, 1/b = Inf ƒCauses SW algorithm to produce NaN  Handle with tests in algorithm ƒUse HW divide for exceptional cases

13 © Copyright IBM Corp Algorithm variations  User callable built-in functions ƒswdiv(a,b): double precision, checking ƒswdivs(a,b): single precision, checking ƒswdiv_nochk(a,b): double, non-checking ƒswdivs_nochk(a,b): single, non-checking  Accuracy of swdiv, swdiv_nochk depends on -qstrict/-qnostrict  _nochk versions faster but have argument restrictions

14 © Copyright IBM Corp Accuracy and Performance Power5 speedup ratio Power4 speedup ratio Power5 ulps max error Power4 ulps max error swdivs swdivs_nochk swdiv strict swdiv nostrict swdiv_nochk strict swdiv_nochk nostrict

15 © Copyright IBM Corp Automatic Generation of Software Division  The swdivs and swdiv algorithms can also be automatically generated by the compiler  Compiler can detect situations where throughput is more important than latency

16 © Copyright IBM Corp Automatic Generation of Software Division  In straight-line code, we use a heuristic that calculates how much FP can be executed in parallel ƒindependent instructions are good, especially other divides ƒdependent instructions are bad (they increase latency)

17 © Copyright IBM Corp Automatic Generation of Software Division  In modulo scheduled loops software-divide code can be pipelined, interleaving multiple iterations  Divides are expanded if divide does not appear in a recurrence (cyclic data- dependence)

18 © Copyright IBM Corp Summary  Software divide algorithms ƒuser callable ƒcompiler generated  Loops of divides ƒup to 1.77x speedup  UMT2K benchmark ƒ1.19x speedup


Download ppt "Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms."

Similar presentations


Ads by Google