Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559.

Slides:



Advertisements
Similar presentations
Spring 2013 Advising Starts this week! CS2710 Computer Organization1.
Advertisements

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.
Chapter Three.
Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3:
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3: IT Students.
Floating Point Numbers
1 Lecture 9: Floating Point Today’s topics:  Division  IEEE 754 representations  FP arithmetic Reminder: assignment 4 will be posted later today.
Chapter 3 Arithmetic for Computers. Multiplication More complicated than addition accomplished via shifting and addition More time and more area Let's.
1  2004 Morgan Kaufmann Publishers Chapter Three.
Integer Arithmetic Floating Point Representation Floating Point Arithmetic Topics.
Computer ArchitectureFall 2008 © August 27, CS 447 – Computer Architecture Lecture 4 Computer Arithmetic (2)
Computer Organization and Architecture Computer Arithmetic Chapter 9.
Computer Arithmetic Nizamettin AYDIN
Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.
Computer Architecture Lecture 3: Logical circuits, computer arithmetics Piotr Bilski.
Computer Arithmetic.
1 Lecture 5 Floating Point Numbers ITEC 1000 “Introduction to Information Technology”
CEN 316 Computer Organization and Design Computer Arithmetic Floating Point Dr. Mansour AL Zuair.
IT253: Computer Organization
Computing Systems Basic arithmetic for computers.
ECE232: Hardware Organization and Design
Floating Point Numbers Topics –IEEE Floating Point Standard –Rounding –Floating Point Operations –Mathematical properties.
07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.
CH09 Computer Arithmetic  CPU combines of ALU and Control Unit, this chapter discusses ALU The Arithmetic and Logic Unit (ALU) Number Systems Integer.
Oct. 18, 2007SYSC 2001* - Fall SYSC2001-Ch9.ppt1 See Stallings Chapter 9 Computer Arithmetic.
Computer Arithmetic II Instructor: Mozafar Bag-Mohammadi Spring 2006 University of Ilam.
Fixed and Floating Point Numbers Lesson 3 Ioan Despi.
Lecture 9: Floating Point
Computer Arithmetic II Instructor: Mozafar Bag-Mohammadi Ilam University.
Floating Point Arithmetic
Conversion to Larger Number of Bits Ex: Immediate Field (signed 16 bit) to 32 bit Positive numbers have implied 0’s to the left. So, put 16 bit number.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3:
Lecture notes Reading: Section 3.4, 3.5, 3.6 Multiplication
Computer Arithmetic Floating Point. We need a way to represent –numbers with fractions, e.g., –very small numbers, e.g., –very large.
Computer Arithmetic See Stallings Chapter 9 Sep 10, 2009
Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.
Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.
Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.
CS 232: Computer Architecture II Prof. Laxmikant (Sanjay) Kale Floating point arithmetic.
Chapter 8 Computer Arithmetic. 8.1 Unsigned Notation Non-negative notation  It treats every number as either zero or a positive value  Range: 0 to 2.
Chapter 9 Computer Arithmetic
William Stallings Computer Organization and Architecture 8th Edition
Floating Point Representations
2.4. Floating Point Numbers
Integer Division.
Lecture 9: Floating Point
Floating Point Number system corresponding to the decimal notation
CS 232: Computer Architecture II
Chapter 6 Floating Point
Topic 3d Representation of Real Numbers
Number Representations
How to represent real numbers
Computer Arithmetic Multiplication, Floating Point
ECEG-3202 Computer Architecture and Organization
Chapter 8 Computer Arithmetic
Topic 3d Representation of Real Numbers
Number Representations
Lecture 9: Shift, Mult, Div Fixed & Floating Point
Presentation transcript:

Floating Point in computers Comply with standards: IEEE 754 ISO/IEC 559

Timeline Introductionquite short Binary reviewnot so long Integer Arithmetic1/3 Floating Point1/3 Floating Point Arithmetic1/3 Other issuesextra short

Introduction Who does computer arithmetic? Intel’s spare money How is it done in hardware? How Integer relates to Floating point Now, we go back to “computer structure”

Binary numbers What is ?

Signed Binary Integers Sign-magnitude 2’s complement 1’s complement biased

Sign-Magnitude High order bit = Sign 0101 = = -5 2 zero’s

2’s complement Number + Negative = 2 n 0101 = = -5 Easy addition (drop carry) Formula: -a n-1 2 n-1 + a n-2 2 n-2 + … +a a 0

1’s Complement Negative - complement to = = -5 2 zero’s Number + Negative = 2 n -1

Biased Binary = Number + Bias Bias = 5: 1101 = 55+5= = -5(-5)+5 = 0 Relative order remains

Integer Arithmetic

Adding (usigned) Integers Elementry school : Result has n+1 bits!

Adding Integers - hardware Half Adder ab C in s C out ab s Full Adder 2 logical levels

Ripple carry Adder a n-1 b n-1 s n-1 C out a n-2 b n-2 C in s n-2 a1a1 b1b1 s1s1 a0a0 b0b0 s0s0 Slow - 2n logical levels Small constant (CMOS) Other ways exist

Adding Signed Integers In 2’s complement: b + (-a)= b + (2 n -a)= 2 n + (b-a) hence - add as integers, discard carry out Example: = ? = (2 n - (b+a)) + 2 n = (2n-b)+(2n-a)(-b) + (-a)

Substracting Integers Add the negation Negating 2’s complement: = ?

Integer (unsigned) Multiplication Elementry school : * Result is 2n bits !

Hardware Multiplier P=0 loop:(i) if A 0 =1, add B to P (ii) right-shift P & A AP B Shift n n Carry n

Integer (unsigned) Division Elementry school : Result: 0100, Rem 1 Dec: 13/3=4, Rem 1

Hardware Divider P=0 loop:(i) left-shift P & A (ii) Sub. B from P: positive: a 0 =1 negative: a 0 =0, restore P (add B) AP B Shift n n+1 0

Example 13 / 3 = 4 (1) n=4 A=1101B=00011P=00000

PAB

PAB Quotient Remainder

Division - remarks Non-restoring Algorithm Load P only if positive Check for 0 (Total) Result is 2n bits!

Integer arithmetic - remarks Signed Multiply and Division –Algorithms exist –We will not use them What to do with extra bits? Faster methods

Floating Point

Non Integers - Other Methods Fixed Point –example: # # #. # –Binary point shifted –Integer arithmetic (extra shifting) –Small number magnitude Rational –a/b(a,b  Z)

Floating Point Exponent + Significand (= Mantisa) x = s 2 e Example: s=101 e=011 x = = 40= =

Uniqueness Denormal Numbers:   10 4 Normalized:#.###  10 #  10 4 What about 0 ?

Floating Point Standard Why Standartize? –Hardware accelerators –Software compatibility –Build Software Libraries –etc….. IEEE ISO/IEC 559 Includes: Structure, Arithmetic results

Float Types 4 Precision Types: –Single –Single extended –Double –Double extended

Single Precision 32 bits: Exponent (e):Biased ( + 127) Significand (f):Fixed fraction: 0. # # # … Nuber:1.f 2 e Sign(1)Exponent(8)Significand(23)

Single Precision - Example = … = …  =2 X = X = - 5  1.01= 1.25

Single Precision - Range E max = 127(e = 254) E min = -126(e = 1) Why |E min |<|E max |? –1/2 E min does not overflow Why Biased notation? What about 0 and 255 ?

Floating Point Precision

Exmaples We shall use base 10 sometimes: f will have 3 digits E max will be 98 E min will be -97 Ex:5.34  10 70

NaN Not a Number Result of ilegal computation: – –Any computation involving a NaN e = E max + 1&f  0 # ####################### Many NaN’s (different f’s)

NaN’s in use Zero finder outside domain –f(x) = sqrt(x) - 1 Works since all computations NaN No exception caused !

Zero’s ? this is NOT 1.0  2 E min ?  0 is signed!  0 both exits! What is the difference?

Signed 0’os +0 = -0 BUT: Multiply/Divide keep sign rules: Monivation: –Using inf correctly (describe later) –log(x): log(0)=-inflog(negative)=Nan log(x) if x  (-0) ?

± inf More logic: e = E max + 1&f = 0 #

Inf usage Example (If tan -1 is defined properly)

More on 0’os and inf’s General Rule for 0/inf arithmetic: –Take appropriate limit: 1/(1/x) where x=0 or inf Why not Max # instead?

Zero’s and inf’s - yet again X/(x 2 +1) is bad!Why? 1/(x+x -1 ) is better Do we need to check for x=0? Using 2 zero’s and inf’s saves some special cases checks.

Denormalized numbers Example: –x= y= –x-y = = 0 –so: x-y=0 but: x  y –think of:if(x  y) then z=1/(x-y) Soluition: –use denormalized numbers!

Denormal Numbers Smallest normal: E min Below, use denormal: 0.f 2 E min e = E min - 1&f  0 # ####################### Gradual underflow: ( /10 ) ( /10 ) ( /10 ) 0

Denormal Numbers Back to our Example: –x= y= –x-y = –and this is not 0 !

Flush to 0 Vs Gradual Underflow

Special Values - Summary ExponentFractionRepresents E min -1 f=0  0 E min -1 f  0 0.f  2 E min E min  e  E max f  2 e E max +1 f=0  0 E max +1 f  0 0.f  2 E min

Rounding Why is rounding needed? Infinit numbers  Finit representation Integers only overflow Almost all operations need rounding IEEE - specifies algorithms for arithmetic

Numbers need rounding Out of range: –x>2  2 E max x<1  2 E min Between 2 floats: – = …. 2 = ….  2 -4 –  2 -4

Measuring Error ULPS(units in last place) –1.12  Vs 0.124: 0.4 ulps –1.12  Vs 0.118: 0.2 ulps Relative Error –Difference/Original –1.12  Vs 0.124: Err=0.004/0.124=0.032

Calculate Using Rounding Benign cancellation –Calculate (= 0.17) 1.01    10 1 = 2.00  –30 upls!

Rounding problems Catastrophic cancellation –b 2 -4ac –both b 2 and 4ac are rounded –the (-) exposes the error –b=3.34 a=1.22 c=2.28 b 2 =11.2 4ac=11.1 b 2 -4ac=0.10 correct=0.0292(70.08 upls)

IEEE Arithmetic Requirement: + -    shold be EXACTLY rounded remaindershold be EXACTLY rounded Integer conv.shold be EXACTLY rounded Not all (transcendental, binary to decimal) “Tie break” - Round to Even

Round to Even How will be rounded ? –Round Up:1.01 –Round Even:1.00 Why? Example: –x i =x i-1 +y-yx0=1.00 y=0.125 –Round up:1.00, 1.01, 1.02, …. –Round even:1.00, 1.00, 1.00, ….

Float Multiplication Integer multiply Biased additio n “ Biased addition ” : ­detect Overflow: Use n+1 bit adder ­detect Underflow:Harder (Denormals)

Rounding Multiplication X Round to X Round to X Round to Round bit 0 Round bit 1 All rest 0 Round bit 1 All rest 0 Shift needed

Round, Guard, Sticky numberguardroundsticky numberroundsticky

Rounding Multiplication AP B Shift n n Carry n x 0 x 1.x 2 x 3 x 4 x 5 g r s s s s x 1.x 2 x 3 x 4 x 5 g X 0. x 1 x 2 x 3 x 4 x 5 Case 1: x 0 =0, shift Case 2: x 0 =1, inc. exp Product Results: Roun d digit Sticky bit

Rounding rules r=0  rounded OK r=1, s=1  add 1 to LSB r=1, s=0  add 1 if LSB=1 Denormals  Extra shifting

Float addition Compute all digits and round? –1.00   = … –too long! Use Round and Sticky bits: –shift to same exponent –r = first discarded digit –s = OR of rest discarded

Float addition - example r=1, s=1 Round needed!  Calculate:   2 -5 Shift exponents:   2 0 r=1 s=0|0|0|1= 1

Signed Addition/Substraction Simplest way- convert to 2’s cmpl. Cancellation of high order bit - shift more bits cancel - How many guard digits? cmpl

Float Division Integer division Biased substractio n Very similar to Multiplication Dividing using integer divide Compute 2 more bits (round, guard) Use remainder as sticky bit (Why?) Sign bit: XOR

More on floats

Rounding modes IEEE specifies 4 modes: –Nearest(default) –towards 0 –towards +inf –towards -inf affects overflow (How?)

Exceptions Set a flag at: –Underflow1.0  2 E min x 1.0  2 E min –Overflow1.0  2 E max x 1.0  2 E max –divide by 01/0 –inexactRounded was needed –invalidNaN return operations flags are sticky

Speeding up Different algorithms may be used Result should be exact divide SRT algorithm in pentium –5/2048 entries in a table –1/9,000,000 chance –check:

Precision Why extended precisions? –Return higher accuracy (D*D  ext. D) –use for computations: