Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559
Timeline Introductionquite short Binary reviewnot so long Integer Arithmetic1/3 Floating Point1/3 Floating Point Arithmetic1/3 Other issuesextra short
Introduction Who does computer arithmetic? Intel’s spare money How is it done in hardware? How Integer relates to Floating point Now, we go back to “computer structure”
Binary numbers What is ?
Signed Binary Integers Sign-magnitude 2’s complement 1’s complement biased
Sign-Magnitude High order bit = Sign 0101 = = -5 2 zero’s
2’s complement Number + Negative = 2 n 0101 = = -5 Easy addition (drop carry) Formula: -a n-1 2 n-1 + a n-2 2 n-2 + … +a a 0
1’s Complement Negative - complement to = = -5 2 zero’s Number + Negative = 2 n -1
Biased Binary = Number + Bias Bias = 5: 1101 = 55+5= = -5(-5)+5 = 0 Relative order remains
Integer Arithmetic
Adding (usigned) Integers Elementry school : Result has n+1 bits!
Adding Integers - hardware Half Adder ab C in s C out ab s Full Adder 2 logical levels
Ripple carry Adder a n-1 b n-1 s n-1 C out a n-2 b n-2 C in s n-2 a1a1 b1b1 s1s1 a0a0 b0b0 s0s0 Slow - 2n logical levels Small constant (CMOS) Other ways exist
Adding Signed Integers In 2’s complement: b + (-a)= b + (2 n -a)= 2 n + (b-a) hence - add as integers, discard carry out Example: = ? = (2 n - (b+a)) + 2 n = (2n-b)+(2n-a)(-b) + (-a)
Substracting Integers Add the negation Negating 2’s complement: = ?
Integer (unsigned) Multiplication Elementry school : * Result is 2n bits !
Hardware Multiplier P=0 loop:(i) if A 0 =1, add B to P (ii) right-shift P & A AP B Shift n n Carry n
Integer (unsigned) Division Elementry school : Result: 0100, Rem 1 Dec: 13/3=4, Rem 1
Hardware Divider P=0 loop:(i) left-shift P & A (ii) Sub. B from P: positive: a 0 =1 negative: a 0 =0, restore P (add B) AP B Shift n n+1 0
Example 13 / 3 = 4 (1) n=4 A=1101B=00011P=00000
PAB
PAB Quotient Remainder
Division - remarks Non-restoring Algorithm Load P only if positive Check for 0 (Total) Result is 2n bits!
Integer arithmetic - remarks Signed Multiply and Division –Algorithms exist –We will not use them What to do with extra bits? Faster methods
Floating Point
Non Integers - Other Methods Fixed Point –example: # # #. # –Binary point shifted –Integer arithmetic (extra shifting) –Small number magnitude Rational –a/b(a,b Z)
Floating Point Exponent + Significand (= Mantisa) x = s 2 e Example: s=101 e=011 x = = 40= =
Uniqueness Denormal Numbers: 10 4 Normalized:#.### 10 # 10 4 What about 0 ?
Floating Point Standard Why Standartize? –Hardware accelerators –Software compatibility –Build Software Libraries –etc….. IEEE ISO/IEC 559 Includes: Structure, Arithmetic results
Float Types 4 Precision Types: –Single –Single extended –Double –Double extended
Single Precision 32 bits: Exponent (e):Biased ( + 127) Significand (f):Fixed fraction: 0. # # # … Nuber:1.f 2 e Sign(1)Exponent(8)Significand(23)
Single Precision - Example = … = … =2 X = X = - 5 1.01= 1.25
Single Precision - Range E max = 127(e = 254) E min = -126(e = 1) Why |E min |<|E max |? –1/2 E min does not overflow Why Biased notation? What about 0 and 255 ?
Floating Point Precision
Exmaples We shall use base 10 sometimes: f will have 3 digits E max will be 98 E min will be -97 Ex:5.34 10 70
NaN Not a Number Result of ilegal computation: – –Any computation involving a NaN e = E max + 1&f 0 # ####################### Many NaN’s (different f’s)
NaN’s in use Zero finder outside domain –f(x) = sqrt(x) - 1 Works since all computations NaN No exception caused !
Zero’s ? this is NOT 1.0 2 E min ? 0 is signed! 0 both exits! What is the difference?
Signed 0’os +0 = -0 BUT: Multiply/Divide keep sign rules: Monivation: –Using inf correctly (describe later) –log(x): log(0)=-inflog(negative)=Nan log(x) if x (-0) ?
± inf More logic: e = E max + 1&f = 0 #
Inf usage Example (If tan -1 is defined properly)
More on 0’os and inf’s General Rule for 0/inf arithmetic: –Take appropriate limit: 1/(1/x) where x=0 or inf Why not Max # instead?
Zero’s and inf’s - yet again X/(x 2 +1) is bad!Why? 1/(x+x -1 ) is better Do we need to check for x=0? Using 2 zero’s and inf’s saves some special cases checks.
Denormalized numbers Example: –x= y= –x-y = = 0 –so: x-y=0 but: x y –think of:if(x y) then z=1/(x-y) Soluition: –use denormalized numbers!
Denormal Numbers Smallest normal: E min Below, use denormal: 0.f 2 E min e = E min - 1&f 0 # ####################### Gradual underflow: ( /10 ) ( /10 ) ( /10 ) 0
Denormal Numbers Back to our Example: –x= y= –x-y = –and this is not 0 !
Flush to 0 Vs Gradual Underflow
Special Values - Summary ExponentFractionRepresents E min -1 f=0 0 E min -1 f 0 0.f 2 E min E min e E max f 2 e E max +1 f=0 0 E max +1 f 0 0.f 2 E min
Rounding Why is rounding needed? Infinit numbers Finit representation Integers only overflow Almost all operations need rounding IEEE - specifies algorithms for arithmetic
Numbers need rounding Out of range: –x>2 2 E max x<1 2 E min Between 2 floats: – = …. 2 = …. 2 -4 – 2 -4
Measuring Error ULPS(units in last place) –1.12 Vs 0.124: 0.4 ulps –1.12 Vs 0.118: 0.2 ulps Relative Error –Difference/Original –1.12 Vs 0.124: Err=0.004/0.124=0.032
Calculate Using Rounding Benign cancellation –Calculate (= 0.17) 1.01 10 1 = 2.00 –30 upls!
Rounding problems Catastrophic cancellation –b 2 -4ac –both b 2 and 4ac are rounded –the (-) exposes the error –b=3.34 a=1.22 c=2.28 b 2 =11.2 4ac=11.1 b 2 -4ac=0.10 correct=0.0292(70.08 upls)
IEEE Arithmetic Requirement: + - shold be EXACTLY rounded remaindershold be EXACTLY rounded Integer conv.shold be EXACTLY rounded Not all (transcendental, binary to decimal) “Tie break” - Round to Even
Round to Even How will be rounded ? –Round Up:1.01 –Round Even:1.00 Why? Example: –x i =x i-1 +y-yx0=1.00 y=0.125 –Round up:1.00, 1.01, 1.02, …. –Round even:1.00, 1.00, 1.00, ….
Float Multiplication Integer multiply Biased additio n “ Biased addition ” : detect Overflow: Use n+1 bit adder detect Underflow:Harder (Denormals)
Rounding Multiplication X Round to X Round to X Round to Round bit 0 Round bit 1 All rest 0 Round bit 1 All rest 0 Shift needed
Round, Guard, Sticky numberguardroundsticky numberroundsticky
Rounding Multiplication AP B Shift n n Carry n x 0 x 1.x 2 x 3 x 4 x 5 g r s s s s x 1.x 2 x 3 x 4 x 5 g X 0. x 1 x 2 x 3 x 4 x 5 Case 1: x 0 =0, shift Case 2: x 0 =1, inc. exp Product Results: Roun d digit Sticky bit
Rounding rules r=0 rounded OK r=1, s=1 add 1 to LSB r=1, s=0 add 1 if LSB=1 Denormals Extra shifting
Float addition Compute all digits and round? –1.00 = … –too long! Use Round and Sticky bits: –shift to same exponent –r = first discarded digit –s = OR of rest discarded
Float addition - example r=1, s=1 Round needed! Calculate: 2 -5 Shift exponents: 2 0 r=1 s=0|0|0|1= 1
Signed Addition/Substraction Simplest way- convert to 2’s cmpl. Cancellation of high order bit - shift more bits cancel - How many guard digits? cmpl
Float Division Integer division Biased substractio n Very similar to Multiplication Dividing using integer divide Compute 2 more bits (round, guard) Use remainder as sticky bit (Why?) Sign bit: XOR
More on floats
Rounding modes IEEE specifies 4 modes: –Nearest(default) –towards 0 –towards +inf –towards -inf affects overflow (How?)
Exceptions Set a flag at: –Underflow1.0 2 E min x 1.0 2 E min –Overflow1.0 2 E max x 1.0 2 E max –divide by 01/0 –inexactRounded was needed –invalidNaN return operations flags are sticky
Speeding up Different algorithms may be used Result should be exact divide SRT algorithm in pentium –5/2048 entries in a table –1/9,000,000 chance –check:
Precision Why extended precisions? –Return higher accuracy (D*D ext. D) –use for computations: