We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published bySonya Pascal
Modified about 1 year ago
Robert Enenkel, Allan Martin IBM ® Toronto Lab Speeding Up Floating- Point Division With In- lined Iterative Algorithms
© Copyright IBM Corp. 2005 Outline Hardware floating-point division The case for software division Software division algorithms Special cases/tradeoffs Performance results Automatic generation
© Copyright IBM Corp. 2005 Hardware Division PPC fdiv, fdivs Advantages ƒaccurate (correctly rounded) ƒhandles exceptional cases (Inf, NaN) ƒlower latency than SW Disadvantages ƒoccupies FPU completely ƒinhibits parallelism
© Copyright IBM Corp. 2005 Alternatives to HW division Vector libraries ƒMASS ƒhigher overhead, greater speedup In-lined software division ƒlow overhead, medium speedup
© Copyright IBM Corp. 2005 Rationale for Software Division Write SW division algorithm in terms of HW arithmetic instructions ƒNewton's method or Taylor series Latency will be higher than HW division But...SW instructions can be interleaved, so throughput may be better Requires enough independent instructions to interleave ƒloop of divisions ƒother work
© Copyright IBM Corp. 2005 Newton's Method To find x such that f(x) = 0, Initial guess x 0 x n+1 = x n - f(x n )/f'(x n ), n=0, 1, 2,... Provided x 0 is close enough ƒx n converges to x ƒIt converges quadratically |x n+1 -x| < c|x n -x|^2 ƒNumber of bits of accuracy doubles with each iteration
© Copyright IBM Corp. 2005 Newton's Method
© Copyright IBM Corp. 2005 Newton Iteration for Division For 1/b, let f(x) = 1/x - b For a/b, use a*(1/b) or f(x) = a/x - b Algorithm for 1/b ƒx 0 ~ 1/b initial guess ƒe 0 = 1 - b*y 0 ƒx 1 = x 0 + e 0 *x 0 ƒe 1 = e 0 *e 0 ƒx 2 = x 1 + e 1 *x 1 ƒetc...
© Copyright IBM Corp. 2005 How Many Iterations Needed? Power5 reciprocal estimate instructions ƒFRES (single precision), FRE (double prec.) ƒ|relative error| <= 2^(-8) Floating-point precision ƒsingle:24 bits ƒdouble:53 bits Newton iterations ƒerror: 2^(-16), 2^(-32), 2^(-64), 2^(-128) ƒsingle: 2 iterations for 1 ulp ƒdouble:3 iterations for 1 ulp ƒ+1 iteration for correct rounding (0.5 ulps)
© Copyright IBM Corp. 2005 Taylor Series for Reciprocal x 0 ~ 1/b initial guess e = 1 - b x 0 1/b = x 0 /(b x 0 ) = x 0 (1/(1-e)) = x 0 (1 + e + e^2 + e^3 + e^4 +...) Algorithm (6 terms) ƒe = 1 - d*x 0 ƒt 1 = 0.5 + e * e ƒq 1 = x 0 + x 0 * e ƒt 2 = 0.75 + t 1 *t 1 ƒt 3 = q 1 *e ƒq 2 = x 0 + t 2 *t 3
© Copyright IBM Corp. 2005 Speed/Accuracy tradeoff IBM compilers have -qstrict/-qnostrict -qstrict: SW result should match HW division exactly -qnostrict: SW result may be slightly less accurate for speed
© Copyright IBM Corp. 2005 Exceptions Even when a/b is representable... 1/b may underflow ƒa ~ b ~ huge, a/b ~ 1, 1/b denormalized ƒCauses loss of accuracy 1/b may overflow ƒa, b denormalized, a/b ~ 1, 1/b = Inf ƒCauses SW algorithm to produce NaN Handle with tests in algorithm ƒUse HW divide for exceptional cases
© Copyright IBM Corp. 2005 Algorithm variations User callable built-in functions ƒswdiv(a,b): double precision, checking ƒswdivs(a,b): single precision, checking ƒswdiv_nochk(a,b): double, non-checking ƒswdivs_nochk(a,b): single, non-checking Accuracy of swdiv, swdiv_nochk depends on -qstrict/-qnostrict _nochk versions faster but have argument restrictions
© Copyright IBM Corp. 2005 Accuracy and Performance Power5 speedup ratio Power4 speedup ratio Power5 ulps max error Power4 ulps max error swdivs1.07 1.050.5 swdivs_nochk1.461.280.5 swdiv strict1.050.5 swdiv nostrict1.501.5 swdiv_nochk strict 1.510.5 swdiv_nochk nostrict 1.771.5
© Copyright IBM Corp. 2005 Automatic Generation of Software Division The swdivs and swdiv algorithms can also be automatically generated by the compiler Compiler can detect situations where throughput is more important than latency
© Copyright IBM Corp. 2005 Automatic Generation of Software Division In straight-line code, we use a heuristic that calculates how much FP can be executed in parallel ƒindependent instructions are good, especially other divides ƒdependent instructions are bad (they increase latency)
© Copyright IBM Corp. 2005 Automatic Generation of Software Division In modulo scheduled loops software-divide code can be pipelined, interleaving multiple iterations Divides are expanded if divide does not appear in a recurrence (cyclic data- dependence)
© Copyright IBM Corp. 2005 Summary Software divide algorithms ƒuser callable ƒcompiler generated Loops of divides ƒup to 1.77x speedup UMT2K benchmark ƒ1.19x speedup
L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.
Lecture 22 Review of floating point representation from last time The IEEE floating point standard (notes) Quit early because half class still not back.
1 Revision. 2 Part I: Errors Floating-point number representations –Round-off and chopping errors Overflow and underflow Absolute vs. Relative Errors.
Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.
EE3561_Unit 2(c)AL-DHAIFALLAH14351 Lecture 5 Newton-Raphson Method Assumptions Interpretation Examples Convergence Analysis Reading Assignment: Sections.
Compiler Exploitation of Decimal Floating-Point Hardware Ian McIntosh, Ivan Sham IBM Toronto Lab.
1 Chapter 3 Root Finding The Bisection Method ► Let f be a continues function. Suppose we know that f(a) f(b) < 0, then there is a root between.
By Nick Bulinski and Justin Gilmore. Solving for a system of equations is not all that complicated for a system of linear equations, but not all equations.
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
Copyright © 2005 Pearson Education, Inc. Publishing as Pearson Addison-Wesley.
A few words about convergence We have been looking at e a as our measure of convergence A more technical means of differentiating the speed of convergence.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Rounding (Nearest 10, 100, 1000) Round each number to the level of accuracy given.
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
CSCI 125 & 161 Lecture 13 Martin van Bommel. Floating Point Data Floating point numbers are not exact Value 0.1 in binary is very close to 1/10, but not.
Floating Point Arithmetic. Hardware vs. Software Can build the ALU (Arithmetic Logic Unit) to perform Floating Point Arithmetic –Faster –More expensive.
Copyright 2008 Koren ECE666/Koren Sample Mid-term 2.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital.
CISE301_Topic11 CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4:
Accuracy Robert Strzodka. 2Overview Precision and Accuracy Hardware Resources Mixed Precision Iterative Refinement.
Gaj1P230/MAPLD 2004 Elliptic Curve Cryptography over GF(2 m ) on a Reconfigurable Computer: Polynomial Basis vs. Optimal Normal Basis Representation Comparative.
What’s Your Guess? Chapter 9: Review of Convergent or Divergent Series.
The MIPS 32 1)Project 1 Discussion? 1)HW 2 Discussion? 2)We want to get some feel for programming in an assembly language - MIPS 32 We want to fully understand.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.
Lecture #18 EEE 574 Dr. Dan Tylavsky Nonlinear Problem Solvers.
CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
1 Optimization Multi-Dimensional Unconstrained Optimization Part I: Non-gradient Methods.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 15 - Floating.
8/30/ Secant Method Major: All Engineering Majors Authors: Autar Kaw, Jai Paul
CSE 246: Computer Arithmetic Algorithms and Hardware Design Instructor: Prof. Chung-Kuan Cheng Fall 2006 Lecture 8: Division.
Newton-Raphson Method. Figure 1 Geometrical illustration of the Newton-Raphson method. 2.
EE 3561_Unit_1(c)Al-Dhaifallah Number Representation Normalized Floating Point Representation Significant Digits Accuracy and Precision
Floating Point. Agenda History Basic Terms General representation of floating point Constructing a simple floating point representation Floating.
Chapter 2 Errors in Numerical Methods and Their Impacts.
Overview CNS 3320 – Numerical Software Engineering.
1 CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4: KFUPM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.
Zhongkai Chen. Gonzalez-Navarro, S. ; Tsen, C. ; Schulte, M. ; Univ. of Malaga, Malaga This paper appears in: Signals, Systems and Computers, ACSSC.
A gentle introduction to floating point arithmetic Ho Chun Hok Custom Computing Group Seminar 25 Nov 2005.
A parallel High Level Trigger benchmark (using multithreading and/or SSE) Håvard Bjerke.
Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.
27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Lecture 18 IEEE Floating-Point Standard ECE 411 – Fall 2015.
CISE301_Topic11 CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4: KFUPM.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
Newton's Method for Functions of Several Variables Joe Castle & Megan Grywalski.
IBM Haifa Research Lab © 2006 IBM Corporation © 2007 IBM Corporation A Framework for the Validation of Processor Architecture Compliance Allon Adir, Sigal.
© 2017 SlidePlayer.com Inc. All rights reserved.