Presentation on theme: "Zhongkai Chen. Gonzalez-Navarro, S. ; Tsen, C. ; Schulte, M. ; Univ. of Malaga, Malaga This paper appears in: Signals, Systems and Computers, 2007. ACSSC."— Presentation transcript:
Gonzalez-Navarro, S. ; Tsen, C. ; Schulte, M. ; Univ. of Malaga, Malaga This paper appears in: Signals, Systems and Computers, ACSSC Conference Record of the Forty-First Asilomar Conference on Publication Date : 4-7 Nov. 2007
Because binary floating-point arithmetic does not provide correct decimal rounding nor exactly represent many decimal fractions, such as 0.01, , and 10 -8, demand for Decimal Floating-Point (DFP) arithmetic is increasing in global business, e-commerce and financial applications. It is estimated that errors from binary floating-point arithmetic can accumulate to an annual error of over $5 million for large billing systems. DFP can be represented by Densely Packed Decimal(DPD),Binary Coded Decimal(BCD), and Binary Integer Decimal(BID). Recently Intel published results for a BID software library.
BID encoding is more appropriate for implementation in software than in hardware. The performance of software implementation is not good enough. The author holds a contrary view: BID is well suited for hardware implementation, since it can share hardware with high-speed binary arithmetic unit. The proposed multiplier can be shared to perform binary floating-point multiplication and other BID-based DFP operations.
Compared with other encoding, the challenging problem of BID is rounding: Rounding off d decimal digits can be performed by dividing the product by 10 d, followed by an optional increment of the truncated result based on the rounding mode. This method for rounding, however, has long latency.
1. let A and B be the DFP operands represented by the triples of (A sign, A c, A exp ) and (B sign, B c, B exp ), respectively. 2. Intermediate product IP c =A c *B c 3. In parallel, IP exp= A exp +B exp 4. The number of digits in IP c is calculated to determine whether the result needs rounding 5. Perform multiplication by 10 -d to round off d digits. And then adjust IP exp
A straight-forward approach to determine d is to count the number of digits in the intermediate product, IP C, using a digit counter unit. Once the number of digits in IP c is computed, the number of digits to round off may be computed as d = max(digits(IP c ) - precision, 0). The drawback of this approach is that IP c must be computed before determining how many digits to round off.
The proposed technique is to use two binary leading-one detectors to determine the bit position of the leading one of both significands. k=A lop +B lop 2 k < A c *B c < 2 k This estimate may be one digit less than the actual number of decimal digits.
For example, suppose that the sum k is 63, so IP c is in the range [2 63, ]. Suppose that the number of decimal digits in IP c is 19 or 20. If the precision is 16, 3 or 4 digits will be rounded off. A lookup table (LUT) indexed by k, stores the minimum number of decimal digits to round off, d'. In this case, d'= 3. To determine exactly how many digits to round off, the same LUT stores pre-calculated values of powers of ten. Specifically, for index k the LUT stores, 10 n, the smallest power of ten greater than 2 k. The sign of a comparison between IP c, and 10 n lets the design determine the exact number of digits to round off. So in position k= 63 is also stored (since >2 63 ), which is compared with IP c. Depending on the result of the comparison, either d = 3 or 4 digits are rounded off. Index-kd10 n
Another LUT stores pre- calculated approximation of w d =10 -d The truncated product is: P=IP c *w d
Why define X and Y as 54 Bits? 1. The size of each decimal64 significand is 54 bits. 2. IP c and w d can each be up 108 bits, so the same multiplier can be reused when rounding is needed.
If rounding is needed: To reuse the same multiplier, IP c and w d are split into upper and lower halves. The inputs of the multiplier X and Y are fed with these halves, which we denote as IP C H = IP c [107:54], IP C L = IP c [53:0], w d H = w d [107:54], and w d L = w d [53:0]. The final product, P = IP c * w d is obtained after four multiplies: (PS 1,PC 1 ) = w d L* IP c L, (PS 2, PC 2 ) = w d L * IP c H, (PS 3, PC 3 ) = W d H * IP c L, and (PS 4, PC 4 ) = w d H * IP c H.
A BID-encoded DFP multiplication only takes two cycles if IPc fits the result's precision. If rounding is necessary, it takes eight cycles to produce a result that complies with the IEEE P754 Draft Standard.
The number of digits to round off is obtained in parallel with the calculation of the intermediate product and a binary multiplier with carry-save feedback is employed. This allows the design to reuse an existing binary multiplier for both significand multiplication and rounding. This multiplier can also be shared to perform BFP multiplication. The design has variable latency to take advantage of the fact that multiplication results are not often rounded. The design demonstrates that BID multiplication can be efficiently implemented in hardware with much better latency than a software implementation.