# Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms.

## Presentation on theme: "Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms."— Presentation transcript:

Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms

© Copyright IBM Corp. 2005 Outline  Hardware floating-point division  The case for software division  Software division algorithms  Special cases/tradeoffs  Performance results  Automatic generation

© Copyright IBM Corp. 2005 Hardware Division  PPC fdiv, fdivs  Advantages ƒaccurate (correctly rounded) ƒhandles exceptional cases (Inf, NaN) ƒlower latency than SW  Disadvantages ƒoccupies FPU completely ƒinhibits parallelism

© Copyright IBM Corp. 2005 Alternatives to HW division  Vector libraries ƒMASS ƒhigher overhead, greater speedup  In-lined software division ƒlow overhead, medium speedup

© Copyright IBM Corp. 2005 Rationale for Software Division  Write SW division algorithm in terms of HW arithmetic instructions ƒNewton's method or Taylor series  Latency will be higher than HW division  But...SW instructions can be interleaved, so throughput may be better  Requires enough independent instructions to interleave ƒloop of divisions ƒother work

© Copyright IBM Corp. 2005 Newton's Method  To find x such that f(x) = 0,  Initial guess x 0  x n+1 = x n - f(x n )/f'(x n ), n=0, 1, 2,...  Provided x 0 is close enough ƒx n converges to x ƒIt converges quadratically |x n+1 -x| < c|x n -x|^2 ƒNumber of bits of accuracy doubles with each iteration

© Copyright IBM Corp. 2005 Newton Iteration for Division  For 1/b, let f(x) = 1/x - b  For a/b, use a*(1/b) or f(x) = a/x - b  Algorithm for 1/b ƒx 0 ~ 1/b initial guess ƒe 0 = 1 - b*y 0 ƒx 1 = x 0 + e 0 *x 0 ƒe 1 = e 0 *e 0 ƒx 2 = x 1 + e 1 *x 1 ƒetc...

© Copyright IBM Corp. 2005 How Many Iterations Needed?  Power5 reciprocal estimate instructions ƒFRES (single precision), FRE (double prec.) ƒ|relative error| <= 2^(-8)  Floating-point precision ƒsingle:24 bits ƒdouble:53 bits  Newton iterations ƒerror: 2^(-16), 2^(-32), 2^(-64), 2^(-128) ƒsingle: 2 iterations for 1 ulp ƒdouble:3 iterations for 1 ulp ƒ+1 iteration for correct rounding (0.5 ulps)

© Copyright IBM Corp. 2005 Taylor Series for Reciprocal  x 0 ~ 1/b initial guess  e = 1 - b x 0  1/b = x 0 /(b x 0 ) = x 0 (1/(1-e)) = x 0 (1 + e + e^2 + e^3 + e^4 +...)  Algorithm (6 terms) ƒe = 1 - d*x 0 ƒt 1 = 0.5 + e * e ƒq 1 = x 0 + x 0 * e ƒt 2 = 0.75 + t 1 *t 1 ƒt 3 = q 1 *e ƒq 2 = x 0 + t 2 *t 3

© Copyright IBM Corp. 2005 Speed/Accuracy tradeoff  IBM compilers have -qstrict/-qnostrict  -qstrict: SW result should match HW division exactly  -qnostrict: SW result may be slightly less accurate for speed

© Copyright IBM Corp. 2005 Exceptions  Even when a/b is representable...  1/b may underflow ƒa ~ b ~ huge, a/b ~ 1, 1/b denormalized ƒCauses loss of accuracy  1/b may overflow ƒa, b denormalized, a/b ~ 1, 1/b = Inf ƒCauses SW algorithm to produce NaN  Handle with tests in algorithm ƒUse HW divide for exceptional cases

© Copyright IBM Corp. 2005 Algorithm variations  User callable built-in functions ƒswdiv(a,b): double precision, checking ƒswdivs(a,b): single precision, checking ƒswdiv_nochk(a,b): double, non-checking ƒswdivs_nochk(a,b): single, non-checking  Accuracy of swdiv, swdiv_nochk depends on -qstrict/-qnostrict  _nochk versions faster but have argument restrictions

© Copyright IBM Corp. 2005 Accuracy and Performance Power5 speedup ratio Power4 speedup ratio Power5 ulps max error Power4 ulps max error swdivs1.07 1.050.5 swdivs_nochk1.461.280.5 swdiv strict1.050.5 swdiv nostrict1.501.5 swdiv_nochk strict 1.510.5 swdiv_nochk nostrict 1.771.5

© Copyright IBM Corp. 2005 Automatic Generation of Software Division  The swdivs and swdiv algorithms can also be automatically generated by the compiler  Compiler can detect situations where throughput is more important than latency

© Copyright IBM Corp. 2005 Automatic Generation of Software Division  In straight-line code, we use a heuristic that calculates how much FP can be executed in parallel ƒindependent instructions are good, especially other divides ƒdependent instructions are bad (they increase latency)

© Copyright IBM Corp. 2005 Automatic Generation of Software Division  In modulo scheduled loops software-divide code can be pipelined, interleaving multiple iterations  Divides are expanded if divide does not appear in a recurrence (cyclic data- dependence)

© Copyright IBM Corp. 2005 Summary  Software divide algorithms ƒuser callable ƒcompiler generated  Loops of divides ƒup to 1.77x speedup  UMT2K benchmark ƒ1.19x speedup

Similar presentations