We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published bySonya Pascal
Modified about 1 year ago
Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms
© Copyright IBM Corp Outline Hardware floating-point division The case for software division Software division algorithms Special cases/tradeoffs Performance results Automatic generation
© Copyright IBM Corp Hardware Division PPC fdiv, fdivs Advantages ƒaccurate (correctly rounded) ƒhandles exceptional cases (Inf, NaN) ƒlower latency than SW Disadvantages ƒoccupies FPU completely ƒinhibits parallelism
© Copyright IBM Corp Alternatives to HW division Vector libraries ƒMASS ƒhigher overhead, greater speedup In-lined software division ƒlow overhead, medium speedup
© Copyright IBM Corp Rationale for Software Division Write SW division algorithm in terms of HW arithmetic instructions ƒNewton's method or Taylor series Latency will be higher than HW division But...SW instructions can be interleaved, so throughput may be better Requires enough independent instructions to interleave ƒloop of divisions ƒother work
© Copyright IBM Corp Newton's Method To find x such that f(x) = 0, Initial guess x 0 x n+1 = x n - f(x n )/f'(x n ), n=0, 1, 2,... Provided x 0 is close enough ƒx n converges to x ƒIt converges quadratically |x n+1 -x| < c|x n -x|^2 ƒNumber of bits of accuracy doubles with each iteration
© Copyright IBM Corp Newton's Method
© Copyright IBM Corp Newton Iteration for Division For 1/b, let f(x) = 1/x - b For a/b, use a*(1/b) or f(x) = a/x - b Algorithm for 1/b ƒx 0 ~ 1/b initial guess ƒe 0 = 1 - b*y 0 ƒx 1 = x 0 + e 0 *x 0 ƒe 1 = e 0 *e 0 ƒx 2 = x 1 + e 1 *x 1 ƒetc...
© Copyright IBM Corp How Many Iterations Needed? Power5 reciprocal estimate instructions ƒFRES (single precision), FRE (double prec.) ƒ|relative error| <= 2^(-8) Floating-point precision ƒsingle:24 bits ƒdouble:53 bits Newton iterations ƒerror: 2^(-16), 2^(-32), 2^(-64), 2^(-128) ƒsingle: 2 iterations for 1 ulp ƒdouble:3 iterations for 1 ulp ƒ+1 iteration for correct rounding (0.5 ulps)
© Copyright IBM Corp Taylor Series for Reciprocal x 0 ~ 1/b initial guess e = 1 - b x 0 1/b = x 0 /(b x 0 ) = x 0 (1/(1-e)) = x 0 (1 + e + e^2 + e^3 + e^4 +...) Algorithm (6 terms) ƒe = 1 - d*x 0 ƒt 1 = e * e ƒq 1 = x 0 + x 0 * e ƒt 2 = t 1 *t 1 ƒt 3 = q 1 *e ƒq 2 = x 0 + t 2 *t 3
© Copyright IBM Corp Speed/Accuracy tradeoff IBM compilers have -qstrict/-qnostrict -qstrict: SW result should match HW division exactly -qnostrict: SW result may be slightly less accurate for speed
© Copyright IBM Corp Exceptions Even when a/b is representable... 1/b may underflow ƒa ~ b ~ huge, a/b ~ 1, 1/b denormalized ƒCauses loss of accuracy 1/b may overflow ƒa, b denormalized, a/b ~ 1, 1/b = Inf ƒCauses SW algorithm to produce NaN Handle with tests in algorithm ƒUse HW divide for exceptional cases
© Copyright IBM Corp Algorithm variations User callable built-in functions ƒswdiv(a,b): double precision, checking ƒswdivs(a,b): single precision, checking ƒswdiv_nochk(a,b): double, non-checking ƒswdivs_nochk(a,b): single, non-checking Accuracy of swdiv, swdiv_nochk depends on -qstrict/-qnostrict _nochk versions faster but have argument restrictions
© Copyright IBM Corp Accuracy and Performance Power5 speedup ratio Power4 speedup ratio Power5 ulps max error Power4 ulps max error swdivs swdivs_nochk swdiv strict swdiv nostrict swdiv_nochk strict swdiv_nochk nostrict
© Copyright IBM Corp Automatic Generation of Software Division The swdivs and swdiv algorithms can also be automatically generated by the compiler Compiler can detect situations where throughput is more important than latency
© Copyright IBM Corp Automatic Generation of Software Division In straight-line code, we use a heuristic that calculates how much FP can be executed in parallel ƒindependent instructions are good, especially other divides ƒdependent instructions are bad (they increase latency)
© Copyright IBM Corp Automatic Generation of Software Division In modulo scheduled loops software-divide code can be pipelined, interleaving multiple iterations Divides are expanded if divide does not appear in a recurrence (cyclic data- dependence)
© Copyright IBM Corp Summary Software divide algorithms ƒuser callable ƒcompiler generated Loops of divides ƒup to 1.77x speedup UMT2K benchmark ƒ1.19x speedup
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
E. T. S. I. Caminos, Canales y Puertos1 Engineering Computation Lecture 4.
Part 1 Chapter 4 Roundoff and Truncation Errors PowerPoints organized by Dr. Michael R. Gustafson II, Duke University All images copyright © The McGraw-Hill.
Linear Least Squares Approximation Jami Durkee. Problem to be Solved Finding Ax=b where there are no solution y=x y=x+2 Interpolation of graphs where.
Growth-rate Functions O(1) – constant time, the time is independent of n, e.g. array look-up O(log n) – logarithmic time, usually the log is base 2, e.g.
Curved Trajectories towards Local Minimum of a Function Al Jimenez Mathematics Department California Polytechnic State University San Luis Obispo, CA
Sheng Yu UM Statistics. Outline Motivation Strategy Sample Algorithms.
Cristian Hill. 6.1 Mocking Mr. Rohol is fun Introduction The CPU performs most of the calculations on the PC The CPU is a single chip on the motherboard.
The CPU The Central Presentation Unit What is the CPU? The Microprocessor Structure of the CPU Parts of the CPU 1.Buses 2.The Control Unit 3.The Arithmetic.
Simpson’s 3/8 Rule By: Mufan Yang. What is Simpson’s 3/8 Rule Simpson’s 3/8 is very similar to the Simpson’s Method that we already learned in class.
Deep packet inspection – an algorithmic view Cristian Estan (U of Wisconsin-Madison) at IEEE CCW 2008.
Section 11.6 – Taylor’s Formula with Remainder. The Lagrange Remainder of a Taylor Polynomial where z is some number between x and c The Error of a Taylor.
Fixed Points and The Fixed Point Algorithm. Fixed Points A fixed point for a function f(x) is a value x 0 in the domain of the function such that f(x.
Copyright 2008 Koren ECE666/Koren Part.4a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
Introducing: common denominator least common denominator like fractions unlike fractions. HOW TO COMPARE FRACTIONS.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Chapter 4 Computation Bjarne Stroustrup
Essential Mathematics for Games Programmers (Fixed/Float Tutorial) Lars Bishop
Javier Junquera Molecular dynamics in different ensembles.
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Accuracy and Precision Measurements are qualitative or quantitative. –Qualitative measurements give results that are descriptive and non-numerical. Example:
Formal Computational Skills Numerical Methods for Differential Equations.
The model of gamma ray registration with detectors of different size Polimaster® © 2013.
Data quality and checking Presentation template for adaptation and use in medicine prices and availability survey training workshop for survey personnel.
Computer Architecture Lecture 31 Fasih ur Rehman.
Manycores in the Future Rob Schreiber hp labs. Dont Forget These views are mine, not necessarily HPs Never make forecasts, especially about the future.
Arc-length computation and arc-length parameterization.
Copyright Josep Torrellas 2003,20081 Cache Coherence Instructor: Josep Torrellas CS533 Term: Spring 2008.
Denise Sakai Troxell (2000) Handling Some of the Problems Encountered When Using Excel Solver for Microsoft Excel 2000.
Copyright © 2003 Pearson Education, Inc. Slide 1.
© 2016 SlidePlayer.com Inc. All rights reserved.