Number Representation Fixed and Floating Point

Number Representation Fixed and Floating Point
No Method Capable of Representing ALL Real Numbers Using Finite Register Lengths Must Use Approximations to Represent Values Concentrate on Two Forms: Fixed Point Floating Point Others are: Rational Number Systems – uses ratios of integers Logarithmic Number Systems – uses signs and logarithms of values

Fixed Versus Floating Point
Fixed Point Values Represent Values where Any Two Differ by 1 unit in the last place (ulp) Equal Spacing Between Numbers Floating Point Values Use Two Multi-Bit Words Mantissa Exponent Both Forms Must be Capable of Representing Signed Quantities Fixed Point Values CAN be Used to Represent Fractional Quantities

Floating Point Characteristics
Total Number of Representations = Total Bit Strings For n-bit Register we have 2n Range of Value is Larger than Fixed Point Precision of Value is Smaller Distance Between Two Consecutive Values Increases

Floating Point s e m s – Sign Bit (signed magnitude)
e – Exponent (in 2’s Complement Form) m – Mantissa (significand or fraction) mMAX=1 - ulp; [0,1) hidden bit float – BIAS = 127 (32 bits-23 for m and 8 for e) double – BIAS=1023 (64 bits-52 for m and 11 for e) Sign of Exponent is Complement of it’s MSb Thus, adding/subtracting bias is just complementation of MSb

Floating Point Example
double = bfe80000 Big Endian – MSW has Higher Address s e m s = 1; e = 1022; m = 0.5 Value = (-1)11.5 2( ) Value = -(1.5)(0.5) = -0.75

Floating Point Normalization
Redundant /representations are Possible! Hidden Bit Helps Out of All Possible Representations, Choose One With Fewest Leading Zeros in Significand This is Normalization After Performing Arithmetic, Renormalization May Need to be Accomplished

Floating Point Special Numbers
Value v when exponent e and fraction f are special values (IEEE standard) Note: NaN = Not a Number

IEEE/ANSI 754/854 Standard

Denormalized Numbers Allows for Gradual Degradation for Underflow

Denormals

Operations – Internal Precision

Floating Point Addition/Subtraction

Floating Point Multiplication/Division

Conversions and Roundings

Exceptions

Rounding Schemes Signed Magnitude Two’s Complement

Round to Nearest (Signed Magnitude)

Rounding Comments

Round to Nearest Even/Odd
Round to Nearest Odd (R*)

Jamming/von Neumann Rounding

ROM Rounding

Rounding

Rounding Examples Round Towards + Downward Directed Rounding

Floating Point Operations

Adders/Subtractors

Operand Packing/Unpacking

Other Key Parts of FP Add/Sub Unit

Pre-Shifting

Four-stage Combinational Shifter
Pre-shifts Operand by 0 to 15 Bits

Leading Zeros/Ones – Counting vs. Prediction

Leading Zeros Prediction

Guard Digits What is the smallest number of extra digits needed for rounding? post-normalization? Multiplication – Double Length Result Add/Sub w/ differing exp. – Can have Double Length Result FP Unit Provides One Length Result

Significand Ranges Assume Significand M(0,1-ulp]
Then Normalized M ranges as: Multiplication: prod=M1M2 For postnormalization need at most one shift left to get:

Significand Ranges (cont)
Division: quot=M1M2 Need at most one shift right to get: Conclusion: 1 Extra Digit Needed for Postnormalization 1 Extra Digit Needed for Round-to-Nearest 2 Extra Digits Needed G - guard R - round

“Sticky Bit” in std754 Round-to-Nearest-Even Requires 1 Extra Bit
The “sticky bit”, S Turns out to be Logical-OR of Other Additional Bits

Floating Point Multiplier

Floating Point Divider

Number Representation Fixed and Floating Point

Similar presentations

Presentation on theme: "Number Representation Fixed and Floating Point"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Number Representation Fixed and Floating Point

Similar presentations

Presentation on theme: "Number Representation Fixed and Floating Point"— Presentation transcript:

Similar presentations

About project

Feedback