1 Extending Summation Precision for Network Reduction Operations George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf Computer Architecture.

Slides:

Advertisements

Similar presentations

2009 Spring Errors & Source of Errors SpringBIL108E Errors in Computing Several causes for malfunction in computer systems. –Hardware fails –Critical.

Advertisements

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.

CENG536 Computer Engineering department Çankaya University.

1 CONSTRUCTING AN ARITHMETIC LOGIC UNIT CHAPTER 4: PART II.

Introduction to Scientific Computing ICE / ICE 508 Prof. Hyuckjae Lee KAIST- ICC

Fixed-Point Arithmetics: Part I

1 Chapter 4: Arithmetic Where we've been: –Performance (seconds, cycles, instructions) –Abstractions: Instruction Set Architecture Assembly Language and.

Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)

1 ECE369 Chapter 3. 2 ECE369 Multiplication More complicated than addition –Accomplished via shifting and addition More time and more area.

Computer ArchitectureFall 2007 © August 29, 2007 Karem Sakallah CS 447 – Computer Architecture.

Number Systems Lecture 02.

+ CS 325: CS Hardware and Software Organization and Architecture Exam 1: Study Guide.

Prepared by: Hind J. Zourob Heba M. Matter Supervisor: Dr. Hatem El-Aydi Faculty Of Engineering Communications & Control Engineering.

The Binary Number System

Computer Organization and Architecture Computer Arithmetic Chapter 9.

Computer Arithmetic Nizamettin AYDIN

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Computer Arithmetic Programming in C++ Computer Science Dept Va Tech August, 2000 © Barnette ND, McQuain WD, Keenan MA 1 Independent Representation.

CEN 316 Computer Organization and Design Computer Arithmetic Floating Point Dr. Mansour AL Zuair.

Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.

IT253: Computer Organization

Lecture 2 Number Representation and accuracy

Number Systems So far we have studied the following integer number systems in computer Unsigned numbers Sign/magnitude numbers Two’s complement numbers.

Computer Architecture

Data Representation By- Mr. S. S. Hire. Data Representation.

Data Representation in Computer Systems

CSC 221 Computer Organization and Assembly Language

Floating Point Arithmetic

Pipeline Extensions prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University MIPS Extensions1May 2015.

COMP201 Computer Systems Floating Point Numbers. Floating Point Numbers  Representations considered so far have a limited range dependent on the number.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations David Gobaud Computational Drug Discovery Stanford University.

CSNB374: Microprocessor Systems Chapter 1: Introduction to Microprocessor.

CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.

Fixed & Floating Number Format Dr. Hugh Blanton ENTC 4337/5337.

Pipelining and Parallelism Mark Staveley

CDA 3101 Fall 2013 Introduction to Computer Organization

Extending Binary In today’s lesson we will look at: representing different types of numbers possible errors binary coded decimal (BCD) comparing BCD with.

Binary Arithmetic.

1 ELEN 033 Lecture 4 Chapter 4 of Text (COD2E) Chapters 3 and 4 of Goodman and Miller book.

Number Representation and Arithmetic Circuits

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Module 2.2 Errors 03/08/2011. Sources of errors Data errors Modeling Implementation errors Absolute and relative errors Round off errors Overflow and.

Cosc 2150: Computer Organization Chapter 9, Part 3 Floating point numbers.

Answer CHAPTER FOUR.

Floating Point Arithmetic – Part I

CHAPTER 5: Representing Numerical Data

Floating Point Representations

Backgrounder: Binary Math

Dr.Faisal Alzyoud 2/20/2018 Binary Arithmetic.

Dr. Clincy Professor of CS

Integer Division.

A Level Computing Component 2

Chapter 6 Floating Point

Exercise: Add these two single precision IEEE 754 numbers: … …0 Left number: 1.101x24 Right number: 1.011x 22= x24.

Topic 3d Representation of Real Numbers

Recent from Dr. Dan Lo regarding 12/11/17 Dept Exam

Errors in Numerical Methods

CSCE 350 Computer Architecture

Errors in Numerical Methods

Arithmetic Logical Unit

Dr. Clincy Professor of CS

Arithmetic Logic Unit A.R. Hurson Electrical and Computer Engineering Missouri University of Science & Technology A.R. Hurson.

Recent from Dr. Dan Lo regarding 12/11/17 Dept Exam

Floating Point Numbers

CPU Structure CPU must:

Chapter 1 / Error in Numerical Method

Presentation transcript:

1 Extending Summation Precision for Network Reduction Operations George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf Computer Architecture Laboratory Lawrence Berkeley National Laboratory

2 Background  64-bit double-precision variables are not precise enough for many operations, such as summations with billions of operands  Because of the limited mantissa bits  Value = Mantissa x 2 (Exp – 1023 – 52)  = 2 but = 2 100

3 Background  Precision loss has been cited as an important problem  Insufficient precision, or different results on different machines  Researchers have resorted to increased or infinite precision libraries Add to 10 8

4 Related Work and Motivation  Intra-node (local processor) computations have a wealth of work:  Sorting or recursion techniques  Software libraries that offer increased or infinite precision  Fixed-point integer representations with hardware support  We focus on distributed summations which occur with a tree-like communication pattern, such as MPI_reduce A+B A B C+D C D A+B+C+D

5 Challenges  In a distributed system, sorting and recursion techniques incur too much communication  Increased precision libraries still not enough  Past work has shown the benefits of doing computation in the NIC without invoking the local processor [1]  NICs have limited programmable logic  Complex data structures for arbitrary precision libraries are infeasible NIC CPU Network [1] F. Petrini et al., “NIC-based reduction algorithms for large-scale clusters,” International Journal on High Performance Computer Networks, vol. 4, no. 3/4, pp. 122–136, 2006.

6 BIG INTEGERS Our solution to enable in-NIC computation with no precision loss:

7 Big Integer Expansions  To represent the same number space as a double-precision variable, we can use a 2101-bit wide fixed-point integer variable  Advantages:  No precision loss  Reproducibility  Simple integer arithmetic  Similar wide integers have been applied to intra-node computations [2] [2] U. Kulisch, “Very fast and exact accumulation of products,” Computing, vol. 91, no. 4, pp. 397–405, 2011.

8 Mapping from Double Variables  Simply shifting the mantissa according to the exponent’s value

9 Applicability to Network Operations  Past work has applied in-NIC computations only for double-precision variables  Can’t apply increased or infinite precision libraries  Programmable logic is limited. For example, Elan3 in Quadrics Qsnet provides a 100MHz RISC processor  Adding dedicated hardware for fully-functional floating point hardware is costly and risky  BigInts make in-NIC computation without precision loss feasible  BigInts require simple integer arithmetic  Tensilica library FPUs use 150,000 gates. Equivalent integer adder uses 380 gates

10 In-NIC Computations With BigInts  Advantages:  Local processor is not woken up from potentially deep sleep  NIC to processor interconnect not stressed  Simple dedicated hardware or programming logic support  Result: Latency and energy benefits  Past work has quoted up to 121% speedup for in-NIC reductions [3]  While avoiding any precision loss [3] F. Petrini et al., “NIC-based reduction algorithms for large-scale clusters,” International Journal on High Performance Computer Networks, vol. 4, no. 3/4, pp. 122–136, 2006.

11 Evaluation  Communication latency  Computation time  Precision gain

12 Communication Latency  For such small payloads, latency is dominated by fixed costs  35% increase versus doubles. 2%-14% compared to double-doubles 50,000 reductions One reduction operation at a time (operations are not pipelined)

13 Computation Time  Modern Intel FPUs require 5 cycles  Increased precision representation may need much more  Double-doubles require 20 operations for a single addition  BigInts match the 5 cycles with a 424-bit integer adder  Integer adder to support Infiniband 4x EDR theoritical peak rate (100 Gb/s) need only be 32 bits operating at 0.6 GHz  This requires 380 gates  Simple FPUs from the Tensilica library use 150,000 gates  In-NIC computation avoids context switching (μs) and waking up the processor from deep sleep (potentially seconds)

14 Arc Length of Irregular Function  We calculate the arc length of  The arc length calculation sums many highly varying quantities

15 Arc Length of Irregular Function  Digit comparison after expressing results in decimal form  BigInt has no precision loss To focus on the network, we assume no precision loss in local-node computations

16 Composite Summation  Adding operands of to 10 8  BigInt equals the analytical result To focus on the network, we assume no precision loss in local-node computations

17 Geometric Series  We calculate:  The answer should never be 2  Doubles report 2 for k > 53  Long doubles for k > 64  Double-doubles for k > 106  BigInts for k > 1024  After k > 1024, the numbers are outside the double-precision variable number space

18 Conclusions  Precision loss in large system-wide operations can be a significant concern  Previously, reduction operations without precision loss could not be performed in the NICs  Wide fixed-point (integer) representations enable this with very simple hardware  Cheap and fast computation without precision loss  BigInts complement intra-node (local processor) techniques

19 Questions?