o History of Floating Point o Defining Floating Point Arithmetic o Floating Point Representation o Floating Point Format o Floating Point Precisions o.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 2: Data types and addressing modes dr.ir. A.C. Verschueren.
Advertisements

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.
Fabián E. Bustamante, Spring 2007 Floating point Today IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.
Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.
Carnegie Mellon Instructors: Dave O’Hallaron, Greg Ganger, and Greg Kesden Floating Point : Introduction to Computer Systems 4 th Lecture, Sep 6,
1 Binghamton University Floating Point CS220 Computer Systems II 3rd Lecture.
Topics covered: Floating point arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Datorteknik FloatingPoint bild 1 Floating point Number system corresponding to the decimal notation 1,837 * 10 significand exponent a great number of corresponding.
University of Washington Today: Floats! 1. University of Washington Today Topics: Floating Point Background: Fractional binary numbers IEEE floating point.
University of Washington Today Topics: Floating Point Background: Fractional binary numbers IEEE floating point standard: Definition Example and properties.
CSE 378 Floating-point1 How to represent real numbers In decimal scientific notation –sign –fraction –base (i.e., 10) to some power Most of the time, usual.
1 Module 2: Floating-Point Representation. 2 Floating Point Numbers ■ Significant x base exponent ■ Example:
Floating Point Numbers
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Floating Point Arithmetic February 15, 2001 Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties IA32 floating.
FLOATING POINT COMPUTER ARCHITECTURE AND ORGANIZATION.
Information Representation (Level ISA3) Floating point numbers.
Computer Organization and Architecture Computer Arithmetic Chapter 9.
Computer Arithmetic Nizamettin AYDIN
CEN 316 Computer Organization and Design Computer Arithmetic Floating Point Dr. Mansour AL Zuair.
Fixed-Point Arithmetics: Part II
IT253: Computer Organization
Number Systems So far we have studied the following integer number systems in computer Unsigned numbers Sign/magnitude numbers Two’s complement numbers.
Computing Systems Basic arithmetic for computers.
ECE232: Hardware Organization and Design
Floating Point Numbers Topics –IEEE Floating Point Standard –Rounding –Floating Point Operations –Mathematical properties.
Floating Point. Agenda  History  Basic Terms  General representation of floating point  Constructing a simple floating point representation  Floating.
Data Representation in Computer Systems
Floating Point (a brief look) We need a way to represent –numbers with fractions, e.g., –very small numbers, e.g., –very large numbers,
Oct. 18, 2007SYSC 2001* - Fall SYSC2001-Ch9.ppt1 See Stallings Chapter 9 Computer Arithmetic.
9.4 FLOATING-POINT REPRESENTATION
Floating Point Arithmetic
Fixed and Floating Point Numbers Lesson 3 Ioan Despi.
CSC 221 Computer Organization and Assembly Language
Integer and Fixed Point P & H: Chapter 3
Computer Arithmetic Floating Point. We need a way to represent –numbers with fractions, e.g., –very small numbers, e.g., –very large.
Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.
Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.
IT11004: Data Representation and Organization Floating Point Representation.
Floating Point Arithmetic Feb 17, 2000 Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties IA32 floating point.
10/7/2004Comp 120 Fall October 7 Read 5.1 through 5.3 Register! Questions? Chapter 4 – Floating Point.
1 University of Washington Fractional binary numbers What is ?
1 Floating Point. 2 Topics Fractional Binary Numbers IEEE 754 Standard Rounding Mode FP Operations Floating Point in C Suggested Reading: 2.4.
Binary Numbers The arithmetic used by computers differs in some ways from that used by people. Computers perform operations on numbers with finite and.
Cosc 2150: Computer Organization Chapter 9, Part 3 Floating point numbers.
Floating Point Representations
Floating Point Numbers
2.4. Floating Point Numbers
Integer Division.
Topics IEEE Floating Point Standard Rounding Floating Point Operations
Floating Point Numbers: x 10-18
Floating Point Number system corresponding to the decimal notation
CS 367 Floating Point Topics (Ch 2.4) IEEE Floating Point Standard
Chapter 6 Floating Point
Arithmetic for Computers
Topic 3d Representation of Real Numbers
Data Representation Data Types Complements Fixed Point Representation
CS 105 “Tour of the Black Holes of Computing!”
How to represent real numbers
Floating Point Arithmetic August 31, 2009
Computer Architecture
CS 105 “Tour of the Black Holes of Computing!”
CS213 Floating Point Topics IEEE Floating Point Standard Rounding
Topic 3d Representation of Real Numbers
CS 105 “Tour of the Black Holes of Computing!”
Computer Organization and Assembly Language
Presentation transcript:

o History of Floating Point o Defining Floating Point Arithmetic o Floating Point Representation o Floating Point Format o Floating Point Precisions o Floating Point Operation o Special values o Error Analysis o Exception Handling o FPU Data Register Stack

 History  8086: first computer to implement IEEE FP  separate 8087 FPU (floating point unit)  486: merged FPU and Integer Unit onto one chip  Summary  Hardware to add, multiply, and divide  Floating point data registers  Various control & status registers  Floating Point Formats  single precision (C float): 32 bits  double precision (C double): 64 bits  extended precision (C long double): 80 bits

o Representable numbers o Scientific notation: +/- d.d…d x r exp o sign bit +/- o radix r (usually 2 or 10, sometimes 16) o significand d.d…d (how many base-r digits d?) o exponent exp (range?) o others? o Operations: o arithmetic: +,-,x,/,... o how to round result to fit in format o comparison ( ) o conversion between different formats o short to long FP numbers, FP to integer o exception handling o what to do for 0/0, 2*largest_number, etc. o binary/decimal conversion o for I/O, when radix not 10 o Language/library support for these operations

o It describes a system for representing real numbers which supports a wide range of values. o A number in which the decimal point can be in any position. o Example: A memory location set aside for a floating-point number can store 0.735, 62.3, or 1200.

o Radix point – or radix character is the symbol used in numerical representations to separate the integer part of a number (to the left of the radix point) from its fractional part (to the right of the radix point). Radix point is a general term that applies to all number bases. Ex: In base 10 (decimal): (decimal point) In base 2 (binary): (binary point) o Fixed point - a number in which the position of the decimal point is fixed. A fixed-point memory location can only accommodate a specific number of decimal places, usually 2 (for currency) or none (for integers). For example, amounts of money in U.S. currency can always be represented as numbers with exactly two digits to the right of the point (1.00, , 0.76, etc.).

o

o Binary Cases Sign bit Biased Exponent Significand or Mantissa where: S is the fraction mantissa or significand. E is the exponent. B is the base, in Binary case

o The IEEE has standardized the computer representation for binary floating-point numbers in IEEE754. This standard is followed by almost all modern machines.

IEEE 754 format – Defines single and double precision formats (32 and 64 bits) – Standardizes formats across many different platforms – Radix 2 – Single » Range to » 8-bit exponent with 127 bias » 23-bit mantissa – Double » Range to » 11-bit exponent with 1023 bias » 52-bit mantissa

IEEE 754: 16-bit : Half (binary16) 32-bit : Single (binary32), decimal32 64-bit : Double (binary64), decimal bit : Quadruple (binary128), decimal128 Other: o Minifloat o Extended precision o Arbitrary-precision

o Single Precision, called " float " in the C language family, and "real" or "real*4" in Fortran. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits). o Double precision, called " double " in the C language family, and " double precision " or " real*8 " in Fortran. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits). o The other basic formats are quadruple precision (128-bit) binary, as well as decimal floating point (64-bit) and "double " (128-bit) decimal floating point.

o Guard Bits – prior to a floating-point operation, the exponent and signicand of each are loaded into ALU registers. The register contains additional bits, called guard bits, which are used to pad out the right end of the significand with 0s. o Rounding – the precision of the result is the rounding policy. The result of any operation on the significands is generally stored in a longer registers.

 Round to nearest : The result is rounded to the nearest representable number.  Round toward + ∞ : The rounded up toward plus infinity.  Round toward - ∞ : The result is rounded down toward negative infinity.  Rounded toward 0 : The result is rounded toward zero.

 Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and the significand (mantissa), from left to right. For the IEEE 754 binary formats they are apportioned as follows:

 IEEE 754 goes beyond the simple definition of a format to lay down specific practices and procedures so that floating-point arithmetic produces uniform, predictable results independent of the hardware platform.

EXPONENT OVERFLOW o A positive exponent exceeds the maximum possible exponent value. In some system, this may be designated as + ∞ or -∞. EXPONENT UNDERFLOW o A negative exponent is less than the minimum possible exponent value (e.g., -200 is less than -127). This means that the numbers is too small to be represented, and it may be reported as 0. SIGNIFICANT UNDERFLOW o In the process of aligning significant, digits may flow off the right end of the significant. As we shall discuss, some form of rounding is required SIGNIFICANT OVERFLOW o The addition of two significant of the same sign may result in a carry out of the most significant bit. This can be fixed by realignment, as we shall explain.

PHASE 1 : ZERO CHECK o Addition and subtraction are identical except for a sign change, the process by changing the sign of the subtracted if it is a subtract operation. Next, if either operand is 0, the other is reported as the result. PHASE 2: SIGNIFICAND ALIGMENT. o The next phase is to manipulate the numbers so that the two exponents are equal. PHASE 3: ADDITIION o The two significands are added together, taking into account their sign. Because the signs may differ, the result may be 0. There is also the possibility of significand overflow by 1 digit. If so, the result is shifted right and the exponent is incremented. An exponent overflow could not occur as a result; this would be reported and the operation halted. PHASE 4: NORMALIZATION o The final phase normalizes the result. Normalization consists of shifting significand digits left until the most significand digit (bit, or 4 bits or base- 16 exponent) is nonzero.

o Signed zero In the IEEE 754 standard, zero is signed, meaning that there exist both a "positive zero" (+0) and a "negative zero" (−0). o Subnormal numbers Subnormal values fill the underflow gap with values where the absolute distance between them are the same as for adjacent values just outside of the underflow gap. o Infinities The infinities of the extended real number line can be represented in IEEE floating point data types, just like ordinary floating point values like 1, 1.5 etc. They are not error values in any way, though they are often (but not always, as it depends on the rounding) used as replacement values when there is an overflow. Upon a divide by zero exception, a positive or negative infinity is returned as an exact result. An infinity can also be introduced as a numeral (like C's "INFINITY" macro, or "∞" if the programming language allows that syntax). o NaNs IEEE 754 specifies a special value called "Not a Number" (NaN) to be returned as the result of certain "invalid" operations, such as 0/0, ∞×0, or sqrt(−1). The representation of NaNs specified by the standard has some unspecified bits that could be used to encode the type of error; but there is no standard for that encoding. In theory, signaling NaNs could be used by a runtime system to extend the floating-point numbers with other special values, without slowing down the computations with ordinary values. Such extensions do not seem to be common, though.

o NAN : Sign bit, nonzero significand, maximum exponent o Invalid Exception o occurs when exact result not a well-defined real number o 0/0 o sqrt(-1) o infinity-infinity, infinity/infinity, 0*infinity o NAN + 3 o NAN > 3? o Return a NAN in all these cases o Two kinds of NANs o Quiet - propagates without raising an exception o good for indicating missing data o Ex: max(3,NAN) = 3 o Signaling - generate an exception when touched o good for detecting uninitialized data

o Normalized Nonzero Representable Numbers: +- 1.d…d x 2 exp o Macheps = Machine epsilon = 2 -#significand bits = relative error in each operation o OV = overflow threshold = largest number o UN = underflow threshold = smallest number o +- Zero: +-, significand and exponent all zero Why bother with -0 later Format # bits #significand bits macheps #exponent bits exponent range Single (~10 -7 ) (~ ) Double (~ ) (~ ) Double >=80 >=64 = (~ ) Extended (80 bits on Intel machines)

o Denormalized Numbers : +-0.d…d x 2 min_exp o sign bit, nonzero significand, minimum exponent o Fills in gap between UN and 0 o Underflow Exception o occurs when exact nonzero result is less than underflow threshold UN o Ex: UN/3 o return a denorm, or zero o Why bother? o Necessary so that following code never divides by zero o if (a != b) then x = a/(a-b)

o +- Infinity: Sign bit, zero significand, maximum exponent o Overflow Exception o occurs when exact finite result too large to represent accurately o Ex: 2*OV o return +- infinity o Divide by zero Exception o return +- infinity = 1/+-0 o sign of zero important! Example later… o Also return +- infinity for o 3+infinity, 2*infinity, infinity*infinity o Result is exact, not an exception!

o Basic error formula o fl(a op b) = (a op b)*(1 + d) where o op one of +,-,*,/ o |d| <=  = machine epsilon = macheps o assuming no overflow, underflow, or divide by zero o Example: adding 4 numbers fl(x 1 +x 2 +x 3 +x 4 ) = {[(x 1 +x 2 )*(1+d 1 ) + x 3 ]*(1+d 2 ) + x 4 }*(1+d 3 ) = x 1 *(1+d 1 )*(1+d 2 )*(1+d 3 ) + x 2 *(1+d 1 )*(1+d 2 )*(1+d 3 ) + x 3 *(1+d 2 )*(1+d 3 ) + x 4 *(1+d 3 ) = x 1 *(1+e 1 ) + x 2 *(1+e 2 ) + x 3 *(1+e 3 ) + x 4 *(1+e 4 ) where each |e i | <~ 3*macheps get exact sum of slightly changed summands x i *(1+e i ) Backward Error Analysis - algorithm called numerically stable if it gives the exact result for slightly changed inputs Numerical Stability is an algorithm design goal

o What happens when the “exact value” is not a real number, or too small or too large to represent accurately? o 5 Exceptions: o Overflow - exact result > OV, too large to represent o Underflow - exact result nonzero and < UN, too small to represent o Divide-by-zero - nonzero/0 o Invalid - 0/0, sqrt(-1), … o Inexact - you made a rounding error (very common!) o Possible responses o Stop with error message (unfriendly, not default) o Keep computing (default, but how?)

o Each of the 5 exceptions has the following features o A sticky flag, which is set as soon as an exception occurs o The sticky flag can be reset and read by the user o reset overflow_flag and invalid_flag o perform a computation o test overflow_flag and invalid_flag to see if any exception occurred o An exception flag, which indicate whether a trap should occur o Not trapping is the default o Instead, continue computing returning a NAN, infinity or denorm o On a trap, there should be a user-writable exception handler with access to the parameters of the exceptional operation o Trapping or “precise interrupts” like this are rarely implemented for performance reasons.

o FPU register format (extended precision) sexpfrac R7 R6 R5 R4 R3 R2 R1 R0 st(0) st(1) st(2) st(3) st(4) st(5) st(6) st(7) Top FPU register stack o stack grows down wraps around from R0 -> R7 o FPU registers are typically referenced relative to top of stack st(0) is top of stack (Top) followed by st(1), st(2),… o push: increment Top, load o pop: store, decrement Top absolute viewstack view

o Large number of floating point instructions and formats  ~50 basic instruction types  load, store, add, multiply  sin, cos, tan, arctan, and log! o Sampling of instructions: InstructionEffectDescription fldzpush 0.0Load zero flds Spush S Load single precision real fmuls Sst(0) <- st(0)*SMultiply faddpst(1) <- st(0)+st(1); popAdd and pop

☺ END