Floating-Point Arithmetic Chapter 9

Floating-Point Arithmetic Chapter 9
Sepehr Naimi

Floating Point Calculation
Using Rational number approximation Fixed-point Floating point

Rational Number Approximation
Using p/q Example 1: b = a * 0.75;  b = a * 3 / 4; Example 2: e = and 193/71 = b = a * e;  b = a * 193 / 71;

Example Solution in C: int area = R0 * R0 * 22 / 7; // area = pi * R0^2 Solution in Assembly: mul r0, r0, r0 @ r0 = r0 to the power of 2 mov r1, #22 mul r0, r0, r1 @ r0 = 22 * r0 mov r1, #7 udiv r0, r0, r1

Fixed-Point Scale numbers with a power of 10 (or power of 2)
Example: If the length is 1.4cm, the length is 14 mm. We can use mm in the case and use integer. Example 2: In patrol stations the gasoline is sold with precision of 0.01 of liters. So, the numbers will become integers if we use a 100 scale. 25.12 liters  2512

Example R0 contains the used gasoline with precision of hundredth of liter. Each liter is 12$. Calculate the price with precision of Cent. Solution: Price in Dollar = liter * 12 = hundredth of liter *12 /100 Price in Cent = hundredth of liter *12 int cent = R0 * 12;

Calculation for Fixed Float
To add or subtract 2 fixed points with the same scaling factor we simply Add and Subtract: 100×m + 100×n = 100×(m+n) For example: 5.40 liter liter = 7.71 liter = 771 hundredth of liter

Calculation for Fixed Float (Cont.)
In multiplication of 2 fixed points with the same scaling factor the result must be divided by the scaling factor: 100×m × 100×n / 100= 100 × 100×(m×n) /100 =100×(m×n) To divide the result must by multiplied with the scaling factor: 100×(100×m) / (100×n) = 100×(m / n)

Floating Point IEEE 754 single precision

Converting numbers to single precision
If the number is positive, bit 31 is 0. If the number is negative, bit 31 is 1. The real number is converted to its binary form. The binary number is normalized to 1.xxxx E yyyy The bias 127 (0x7F) is added to the exponent portion, yyyy, to get the biased exponent, which is placed in bits 30 to 23. The significand, xxxx, is placed in bits 22 to 0.

Example: Convert 9.7510 to IEEE 754 single-precision floating-point format
Solution: Sign bit 31 is 0 for positive Decimal 9.75 = binary which is normalized to E 3 Exponent bits 30 to 23 are after adding the bias (3 + 0x7F = 0x82) Significand bits 22 to 0 are Putting them all together gives the following:

IEEE 754 Double-Precision Floating Point

Example Convert to IEEE754 double-precision floating-point format. Solution: Sign bit 63 is 0 for positive Decimal = binary which is normalized to E 7 Exponent bits 62 to 53 are after adding the bias (7 + 0x3FF = 0x406) Significand bits 52 to 0 are Putting them all together gives the following: 0100 0000 0110 0011 ...

Half-precision Floating-Point

Arm Arithmetic Co-processors
VFP (Vectored Floating-Point): performs single-precision and double-precision arithmetic operations that are fully compliant to IEEE 754 standard NEON: SIMD (Single Instruction Multiple Data) Supports integers, fixed-point numbers, and single-precision Used for media applications and digital signal processing

VFP and NEON in Raspberry Pi
VFP NEON Raspberry Pi 1 VFPv2 No Raspberry Pi 2 VFPv3 Yes Raspberry Pi 3 VFPv4 Raspberry Pi Zero

Registers in VFPv2

Floating-point status and control register (FPSCR)
Bits Name Function 31-28 N, Z, C, V Negative, Zero, Carry, Overflow flags 25 DN Default NaN mode control 24 FZ Flush-to-zero mode control 23-22 RMode Rounding Mode control 21-20 Stride Step size in vector 18-16 Len Length of the vector 15, 12-8 Exception trap enable bits 7, 4-0 Cumulative exception bits

Floating-Point Data Processing Instructions
Mnemonic Function Description VABS Absolute Obtain the absolute value of the operand VNEG Negate Negate the value of the operand VSQRT square root Obtain the square root of the operand VADD Add Add the operands VSUB Subtract Subtract the second operand from the first operand VDIV Divide Divide the first operand by the second operand VMUL Multiply Multiply the two operands VNMUL multiply negate Multiply the two operands then negate the result VMLA multiply and accumulate Multiply the two operands then add the result to the destination register and store the final result in the destination register VNMLA multiply and accumulate negate Multiply the two operands then add the result to the destination register, negate the final result and store it in the destination register VMLS multiply and subtract Multiply the two operands then subtract the result from the destination register and store the final result in the destination register VNMLS multiply and subtract negate Multiply the two operands then subtract the result from the destination register, negate the final result and store it in the destination register VFMA fused multiply and accumulate Same as VMLA except using fused operation (single rounding at the final result) VFMS fused multiply and subtract Same as VMLS except using fused operation VFNMA fused multiply and accumulate negate Same as VNMLA except using fused operation VFNMS fused multiply and subtract negate Same as VNMLS except using fused operation VCMP Compare Subtract the second operand from the first operand and set the NZCV bits of FPSCR

Format modifiers of Floating-Point Instructions
Type .f32 Single Precision .f64 Double Precision Examples: vabs.f32 s1, s0 @ s1 = abs(s0) vneg.f64 d1, d1 = -d0

VMOV Between CPU register and the VFP register:
vmov.f32 s1, r1 @ copy content of R1 to S1 Between the VFP registers vmov.f32 s2, s1 @ copy content of S1 to S2 vmov.f32 r2, s2 @ copy content of S2 to R2 Immediate values (in VFPv3 and later): vmov.f32 s1, # @ load S1 with 2.0 vmov.f32 s2, # @ load S2 with 0.125

VLDR and VSTR Examples:
vldr.f32 s2, [r2, #4] @ R2 holds the base addr. vstr.f32 s2, [r3, #-4] @ R3 holds the base addr.

Example Write a program to calculate the area of a circle with single-precision floating-point format. The radius of the circle is in register S0 and area should be left in S0. Solution: vmul.f s0, s0, s0 @ calculate r^2 ldr r2,=piNumber vldr.f s1, [r2] @ load pi vmul.f32 s0, s0, s1 @ multiply pi ... piNumber: .float

Example: Write a program to add two floating-point numbers in the memory and save the result in the memory. Solution: .text .global _start _start: ldr r3, load address of operand1 vldr.f32 s0,[r3] @ load operand1 in S0 ldr r3, load address of operand2 vldr.f32 s1,[r3] @ load operand1 in S0 vadd.f32 s0, s0, add operand2 to operand1 ldr r3,=sum @ load address of sum vstr.f32 s0,[r3] @ store the result in sum mov r7, #1 svc 0 operand1: .float 32.5 operand2: .float 23.4 .data sum: .space 4

VMSR and VMRS VMRS moves the VFP system register content to one of the ARM registers Example: Moving NZCV flags to the CPSR: VMRS APSR_nzcv, FPSCR VMSR: moves an Arm register to one of the VFP system registers.

VCVT Converts between types: Examples: VCVT.type.type Sd, Sm
Modifier Type .f32 Single Precision .f64 Double Precision .U32 32-bit Unsigned integer .S32 32-bit Signed integer Examples: vcvt.f32.s32 s1, s0 @ signed int. to float vcvt.f64.f32 d1, s0 @ float to double

Floating-Point Arithmetic Chapter 9

Similar presentations

Presentation on theme: "Floating-Point Arithmetic Chapter 9"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Floating-Point Arithmetic Chapter 9

Similar presentations

Presentation on theme: "Floating-Point Arithmetic Chapter 9"— Presentation transcript:

Similar presentations

About project

Feedback