Computational Methods

Computational Methods
CSE 551 Computational Methods 2018/2019 Fall Chapter 3 Error Analysis and Computer Arithmetic

Outline Base Changes Introduction to Error Analysis Floating-Point Representations

References W. Cheney, D Kincaid, Numerical Mathematics and Computing, 6ed, Chapter 1 Chapter 2 Appendix B

Introduction general number representation - to bases 2, 8, and 16
bases primarily used in computer arithmetic The familiar decimal notation for numbers uses the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. a whole number such as 37294 individual digits represent coefficients of powers of 10: We begin with a discussion of general number representation but move quickly to bases 2, 8, and 16, as they are the bases primarily used in computer arithmetic. The familiar decimal notation for numbers uses the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. When we write a whole number such as 37294, the individual digits represent coefficients of powers of 10 as follows: 37294 = = 4 × × × × × 104 Thus, in gener

in general, a string of digits represents a number according to the formula
anan− a2a1a0 = a0 × a1 × 101 +· · ·+an−1 × 10n−1 + an × 10n This takes care of only the positive whole numbers.A number between 0 and 1 is represented by a string of digits to the right of a decimal point. For example, we see that = 7 × 10−1 + 2 × 10 −2 + 1 × 10 −3 + 5 × 10 −4

In general, we have the formula
there can be an infinite string of digits to the right of the decimal point; indeed, there must be an infinite string to represent some numbers. For example, we note that

For a real number of the form
the integer part is the first summation the fractional part is the second summation a number represented in base β is signified by enclosing it in parentheses and adding a subscript β

β Base Numbers Other bases used, especially in computers e.g.,
the binary system uses 2 as the base the octal system uses 8 the hexadecimal system uses 16 In the octal representation of a number digits - 0, 1, 2, 3, 4, 5, 6, 7 e.g., (21467)8 = × × × × 84 = 7 + 8(6 + 8(4 + 8(1 + 8(2)))) = 9015 A number between 0 and 1, expressed in octal represented with combinations of 8−1, 8−2, and so on. ( )8 = 3 × 8−1 + 6 × 8−2 + 2 × 8−3 + 0 × 8−4 + 7 × 8−5 = 8−5(3 × × × ) = 8−5(7 + 82(2 + 8(6 + 8(3)))) = / 32768 =

decimal point - base-10 numbers
If we use another base, say, β, then numbers represented in the β-system look like this: The digits: 0, 1, , β −2, β −1 If β > 10 necessary to introduce symbols for 10, 11, , β − 1 The separator between the integer and fractional part - called the radix point decimal point - base-10 numbers

Conversion of Integer Parts
formalize the process of converting a number from one base to another consider separately the integer and fractional parts of a number a positive integer N with base γ : to convert this to the number system with base β Write N in its nested form:

replace each of the numbers on the right by its representation in base β
Next, carry out the calculations in β-arithmetic. replacement of the ak ’s and γ by equivalent base-β numbers - a table how each of the numbers 0, 1, , γ −1 appears in the β-system a base-β multiplication table may be required.

for hand calculations:
decimal number 3781 to binary form the decimal binary equivalences longhand multiplication in base 2, for hand calculations: Write down an equation digits c0, c1, , cm :

if N is divided by β, then the remainder in this division is c0, and the quotient is
If this number is divided by β, the remainder is c1, and so on divide repeatedly by β saving remainders c0, c1, , cm and quotients.

Example Convert the decimal number 3781 to binary form using the division algorithm. Solution: divide repeatedly by 2, saving the remainders

Here, the symbol ˙↓ is used to remind us that the digits ci are obtained beginning with the
digit next to the binary point. Thus, we have (3781.)10 = ( )2 and not the other way around: ( )2 = (2615)10

Example Convert the number N = ( )2 to decimal form by nested multiplication. Solution:

Another conversion problem exists in going from an integer in base γ to an integer in base β
when using calculations in base γ the unknown coefficients determined by a process of successive division this arithmetic is carried out in the γ –system At the end, the numbers ck are in base γ a table of γ -β equivalents

e.g., convert a binary integer into decimal form by repeated division by (1 010)2
equals (10)10 carrying out the operations in binary A table of binary-decimal equivalents binary division is easy only for computers develop alternative procedures

Conversion of Fractional Parts
convert a fractional number such as (0.372)10 to binary a direct yet naive approach:

Dividing in binary arithmetic is not straightforward
easier ways conversion. x in the range 0 < x < 1 and that the digits ck in the representation are to be determined

it is necessary to shift the radix point only when multiplying by base β
the unknown digit c1 can be described as the integer part of βx denoted by I(βx). The fractional part, (0.c2c3c )β denoted by F(βx) The process is repeated in the following pattern: the arithmetic is carried out in the decimal system.

Example Use the preceding algorithm to convert the decimal number x = (0.372)10 to binary form.

repeatedly multiplying by 2 and removing the integer parts:
(0.372)10 = ( )2

Base Conversion 10↔8↔2 Most computers - binary system representation of numbers. The octal system (base 8) useful in converting from the decimal system to the binary system and vice versa With base 8, the positional values of the numbers 80 = 1, 81 = 8, 82 = 64, 83 = 512, 84 = 4096,...

Example

converting between decimal and binary form
convenient - octal representation - intermediate step Conversion between octal and decimal octal and binary – simple starts at the binary point and proceeds in both directions. ( )2 = ( )8 Conversion of an octal number to binary can be done in a similar manner but in reverse order. ( )8 = ( )2

Example What is ( )10 in octal and binary forms? Solution: convert the decimal number first to octal and then to binary For the integer partrepeatedly divide by 8: 2576. = (5020.)8 = ( )2

For the fractional part - repeatedly multiply by 8
= (0.266)8 = ( )2 the result = ( )2

Base 16 hexadecimal system (base 16) A, B, C, D, E, and F represent 10, 11, 12, 13, 14, and 15, respectively table of equivalences:

Conversion between binary numbers and hexadecimal numbers
regroup the binary digits to groups of four ( )2 = ( )2 = (2BAD)16 and ( )2 = ( )2 = (7AF2.C9E)16

More Examples convert (0.276)8, (0.C8)16, and (492)10 into different number systems

Significant Digits digits beginning with the leftmost nonzero digit and ending with the rightmost correct digit, including final zeros that are exact.

Example solving for the variable y in this linear system of equations in two variables x y = x y = First, carry only three significant digits of precision in the calculations Second, repeat with four significant digits throughout Finally, use ten significant digits.

Solution first task - round all numbers in the original problem to three digits round all the calculations, keeping only three significant digits take a multiple α of the first equation and subtract it from the second equation to eliminate the x-term in the second equation The multiplier is α = 0.208/0.104 ≈ 2.00 in the second equation, - new coefficient of the x-term:: − (2.00)(0.104) ≈ − = 0 new y-term coefficient: 0.425 − (2.00)(0.212) ≈ − = 0.001 righthand side: 0.933 − (2.00)(0.738) = − 1.48 = −0.547 y = −0.547/(0.001) ≈ −547.

keep four significant digits:
the multiplier: α = / ≈ 2.009 In the second equation - new coefficient of the x-term: − (2.009)(0.1036) ≈ − = 0 new coefficient of the y-term: − (2.009)(0.2122) ≈ − = − new right-hand side: 0.9327−(2.009)(0.7381) ≈ −1.483 ≈ −0.5503 y = −0.5503/(− ) ≈ 343.9 shocked to find that the answer has changed from −547 to 343.9, which is a huge difference!

carry ten significant decimal digits
find that: even is not accurate obtain: y = The lesson learned: data thought to be accurate should be carried with full precision and not be rounded off prior to each of the calculations

In most computers, the arithmetic operations
a double-length accumulator twice the precision of the stored quantities may not avoid a loss of accuracy! Loss of accuracy roundoff errors subtracting nearly equal numbers

The point of intersection of the two lines - exact solution
Figure - geometric illustration of what can happen in solving two equations in two unknowns The point of intersection of the two lines - exact solution dotted lines - degree of uncertainty from errors in the measurements or roundoff errors. sharply defined point v.s. small trapezoidal area containing many possible solutions. if the two lines are nearly parallel area of possible solutions can increase dramatically! well-conditioned and ill-conditioned systems of linear equations

In 2D, wellconditione an ill-conditioned linear systems

Errors: Absolute and Relative
α, β - two numbers one is regarded as an approximation to the other The error of β as an approximation to α: α − β; the error – the exact value minus the approximate value The absolute error of β as an approximation to α: |α −β| The relative error of β as an approximation to α: |α −β|/|α| in absolute error, the roles of α and β are the same, in computing the relative error, relative error is undefined in the case α = 0.

relative error is usually more meaningful than the absolute error
e.g., α1 = 1.333, β1 = 1.334 α2 = 0.001, β2 = 0.002 absolute error of βi as an approximation to αi : the same in both cases - 10−3 However, the relative errors: (3/4) × 10−3 and 1, respectively relative error clearly indicates that β1 is a good approximation to α1 but that β2 is a poor approximation to α2

the exact value - the true value
In summar the exact value - the true value A useful way to express the absolute error and relative error - to drop the absolute values: (relative error)(exact value) = exact value − approximate value approximate value = (exact value)[1 + (relative error)] relative error - related to the approximate value rather than to the exact value the true value may not be known

Example Consider x = rounded to x_head = and y = rounded to y_head = 30.16 What are the number of significant digits, absolute errors, and relative errors? Interpret the results.

Case 1. x_head = 0.35 × 10−2 - two significant digits,
Solution Case 1. x_head = 0.35 × 10−2 - two significant digits, absolute error: 0.3 × 10−4 relative error × 10−2 Case 2. y_head = × four significant digits absolute error: 0.2 × 10−2 relative error 0.66 × 10−4. the relative error is a better indication of the number of significant digits than the absolute error

Accuracy and Precision
Accurate to n decimal places can trust n digits to the right of the decimal place Accurate to n significant digits can trust a total of n digits as being meaningful beginning with the leftmost nonzero digit.

a ruler graduated in millimeters to measure lengths
The measurements will be accurate to one millimeter, or m three decimal places written in meters A measurement such as m would be accurate to three decimal places A measurement such as m would be meaningless, since the ruler produces only three decimal places and it should be m or m. If the measurement m has five dependable digits then it is accurate to five significant figures. a measurement such as m has only two significant figures.

using a calculator or computer in a laboratory experiment, one may get a false
sense of having higher precision than is warranted by the data e.g., (1.2) + (3.45) = 4.65 only two significant digits of accuracy because the second digit in 1.2 may be the effect of rounding 1.24 down or rounding 1.16 up to two significant figures Then the left-hand side - as large as (1.249) + (3.454) = (4.703) or as small as (1.16) + (3.449) = (4.609)

In Addition and Subtraction
In adding and subtracting numbers the result is accurate only to the smallest number of significant digits used in any step of the calculation In the above example, the term 1.2 has two significant digits; therefore, the final calculation has an uncertainty in the third digit

In multiplication and division of numbers
Rule of Thumb In multiplication and division of numbers the results may be even more misleading. e.g., computations on a calculator: (1.23)(4.5) = 5.535 (1.23)/(4.5) = there are four and nine significant digits inhe results but there are really only two As a rule of thumb keep as many significant digits in a sequence of calculations as there are in the least accurate number involved in the computations.

Rounding: reduces the number of significant digits in a number
Rounding and Shopping Rounding: reduces the number of significant digits in a number The result of rounding: a number similar in magnitude that is a shorter number having fewer nonzero digits several slightly different rules for rounding: The round-to-even method statistician’s rounding or bankers’ rounding tends to reduce the total rounding error with (on average) an equal portion of numbers rounding up as well as rounding down.

a number x is chopped to n digits or figures
when all digits that follow the nth digit are discarded and none of the remaining n digits are changed x is rounded to n digits or figures when x is replaced by an n-digit number that approximates x with minimum error The question of whether to round up or down an (n+1)-digit decimal number that ends with a 5 always selecting the rounded n-digit number with an even nth digit seem strange at first computers - rounding decimal calculations standard floating-point arithmetic!

Examples e.g., the results of rounding some three-decimal numbers to two digits: 0.217 ≈ 0.22, ≈ 0.36, 0.475 ≈ 0.48, ≈ 0.59 chopping: 0.217 ≈ 0.21, ≈ 0.36, 0.475 ≈ 0.47, ≈ 0.59 On the compute has the option - arithmetic operations done with either chopping or rounding The latter is usually preferable

Floating-Point Representation
normalized scientific notation shifting the decimal point and supplying appropriate powers of 10 e.g., = × 102 = × 10−2 = × 107 the number is represented by a fraction multiplied by 10n and the leading digit in the fraction is not zero (except when the number is zero) write as × 105 not × 106 or × 104

Normalized Floating-Point Representation
computer science normalized scientific notation is also called normalized floating-point representation In the decimal system, any real number x (other than zero) represented in normalized floating-point form: x = ±0.d1d2d × 10n where d1  0 n is an integer (positive, negative, or zero) The numbers d1, d2, are the decimal digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.

normalized floating-point decimal form as
Stated another way, the real number x, if different from zero, can be represented in normalized floating-point decimal form as This representation consists of three parts: a sign that is either + or − a number r in the interval [1/10, 1) an integer power of 10. The number r is called the normalized mantissa n the exponent

The floating-point representation in the binary system is similar to that in the decimal
system in several ways. If x  0, it can be written as The mantissa q would be expressed as a sequence of zeros or ones in the form q =(0.b1b2b )2 where b1  0 necessarily q 1/2 .

in Computer A floating-point number system within a computer is similar to what we have just described, with one important difference: Every computer has only a finite word length and a finite total capacity so only numbers with a finite number of digits can be represented A number is allotted only one word of storage single-precision mode (two or more words in double or extended precision) the degree of precision is strictly limited irrational numbers cannot be represented nor can those rational numbers that do not fit the finite format imposed by the computer Furthermore, numbers may be either too large or too small to be representable The real numbers - representable in a computer its machine numbers.

Finite v.s. NOnterminating Expension
any number used in calculations with a computer system conform to the format of numbers in that system must have a finite expansion Numbers that have a nonterminating expansion - cannot be accommodated precisely. Moreover, a number that has a terminating expansion in one base may have a nonterminating expansion in another e.g., 1/10 = (0.1)10 = ( )8 = ( )2 The important point: most real numbers cannot be represented exactly in a computer:

The effective number system - a computer
Extreame Example The effective number system - a computer is not a continuum but a rather peculiar discrete set an extreme example floating-point numbers x = ±(0.b1b2b3)2 × 2±k where b1, b2, b3, k - value 0 or 1. List all the floating-point numbers that can be expressed in the form x = ±(0.b1b2b3)2 × 2±k (k, bi  {0, 1})

there is some duplication - nonnegative numbers:
Solution There are two choices for the ± two choices for b1 two choices for b2 two choices for b3 three choices for the exponent would expect 2 × 2 × 2 × 2 × 3 = 48 different numbers there is some duplication - nonnegative numbers:

Positive machin numbers in Example

Altogether there are 31 distinct numbers in the system
The positive numbers obtained are shown on a line the numbers are symmetrically but unevenly distributed about zero

owerflow / underflow If, in the course of a computation, a number x is produced of the form±q×2m where m is outside the computer’s permissible range an overflow or an underflow has occurred or that x is outside the range of the computer Generally, an overflow results in a fatal error (or exception), and the normal execution of the program stops An underflow, usually treated automatically by setting x to zero without any interruption of the program but with a warning message in most computers.

in the Example In a computer whose floating-point numbers are restricted to the form in Example any number closer to zero than 1/16 would underflow to zero any number outside the range −1.75 to would overflow to machine infinity.

Hole at Zero in the example normalized floating-point numbers, all numbers - with the exception of zero- have the form x = ±(0.1b2b3)2 × 2±k This creates a phenomenon known as the hole at zero. nonnegative machine numbers are now distributed as in the following figure There is a relatively wide gap between zero and the smallest positive machine number, which is (0.100)2 × 2−1 = 1/4

Normalize machin numbers i Example

Floating-Pointg Representation
A computer - operates in floating-point mode except for the limitations imposed by the finite word length Many binary computers - a word length of 32 bits The internal representation of numbers and their storage - standard floating-point form

single-precision floating-point numbers
all acceptable numbers in a computer using the standard single-precision floating-point arithmetic format assume - stores these numbers in 32-bit words This set - finite subset of the real numbers It consists of: ±0, ±∞, normal and subnormal single-precision floating-point numbers, but not NotaNumber (NaN) values most real numbers cannot be represented exactly as floating-point numbers they have infinite decimal or binary expansions all irrational numbers and some rational numbers e.g., π, e, 1/3 , 0.1

32-bitword-length - as much as possible of the normalized floating-point number - ±q × 2m
must be contained in those 32 bits One way of allocating the 32 bits: sign of q 1 bit integer |m| 8 bits number q 23 bits Information on the sign of m is contained in the eight bits allocated for the integer |m|. represent real numbers with |m| as large as 27 − 1 = 127. The exponent represents numbers from −127 through 128.

Single-Precision Floating-Point Form
describe a machine number of the following form in standard single-precision floating-point representation: (−1)s × 2c−127 × (1.f )2 The leftmost bit - for the sign of the mantissa s = 0 corresponds to+ s = 1 corresponds to − The next eight bits - to represent the number c in the exponent of 2c−127 interpreted as an excess-127 code the last 23 bits represent f from the fractional part of the mantissa in the 1-plus form: (1.f )2

normalized representation of a nonzero floating-point number:
the first bit in the mantissa is always 1 this bit does not have to be stored shifting the binary point to a “1-plus” form (1.f )2 The mantissa - rightmost 23 bits - contains f with an understood binary point mantissa (significand) actually corresponds to 24 binary digits there is a hidden bit. An important exception is the number ±0

procedure for determining the representation of a real number x
If x is zero, it is represented by a full word of zero bits with the possible exception of the sign bit For a nonzero x: first assign the sign bit for x and consider |x| Then convert both the integer and fractional parts of |x| from decimal to binary Next one-plus normalize (|x|)2 by shifting the binary point so that the first bit to the left of the binary point is a 1 and all bits to the left of this 1 are 0 To compensate for this shift of the binary point, adjust the exponent of 2; that is, multiply by the appropriate power of 2. The 24-bit one-plus-normalized mantissa in binary is found

Now the current exponent of 2 should be set equal to
c − 127 to determine c, which is then converted from decimal to binary. The sign bit of the mantissa is combined with (c)2 and (f )2 Finally, write the 32-bit representation of x as eight hexadecimal digits.

Partitione floating-poin single-precisio computer word

actual exponent of the number is restricted by the inequality
The value of c in the representation of a floating-point number in single precision: restricted by the inequality 0 < c < ( )2 = 255 The values 0 and 255 are reserved for special cases - ±0 and ±∞, respectively. actual exponent of the number is restricted by the inequality −126 <= c − 127 <= 127 the mantissa of each nonzero number is restricted by the inequality 1<=(1.f )2<=( )2=2 − 2−23

The largest number representable:
(2 − 2−23)2127 ≈ 2128 ≈ 3.4 × 1038 The smallest positive number: 2−126 ≈ 1.2 × 10−38 The binary machine floating-point number ε = 2−23 is called the machine epsilon single precision It is the smallest positive machine number ε such that 1 + ε  1. 2−23 ≈ 1.2 × 10−7, in a simple computation, approximately six significant decimal digits of accuracy may be obtained in single precision Recall that 23 bits are allocated for the mantissa.

Double-Precision Floating-Point Form
each double-precision floating-point number is stored in two computer words in memory 52 bits allocated for the mantissa The double precision machine epsilon - 2−52 ≈ 2.2 × 10−16 - approximately 15 significant decimal digits of precision 11 bits allowed for the exponent which is biased by 1023 represents numbers from −1022 through 1023 A machine number in standard doubleprecision floating-point form: (−1)s × 2c−1023 × (1.f )2

The leftmost bit - for the sign of the mantissa The next eleven bits
s = 0: + and s = 1: − The next eleven bits to represent the exponent c corresponding to 2c−1023 52 bits represent f from the fractional part of the mantissa in the one-plus form: (1.f )2 The value of c - floating-point number in double precision: restricted by the inequality 0 < c < ( )2 = 2047 the values at the ends of this interval are reserved for special cases. actual exponent - restricted by the inequality −1022 <= c − 1023 <= 1023

the mantissa of each nonzero number – restricted:
= 2 − 2−52 2−52 ≈ 1.2 × 10−16 in a simple computation approximately 15 significant decimal digits of accuracy may be obtained in double precision. 52 bits are allocated for the mantissa The largest double-precision machine number: (2 − 2−52)21023 ≈ ≈ 1.8 × 10308 The smallest double-precision positive machine number 2−1022 ≈ 2.2 × 10−308

In single precision, 31 bits - available for an integer
Single precision on a 64-bit computer -comparable to double precision on a 32-bit computer whereas double precision on a 64-bit computer gives four times the precision available on a 32-bit computer In single precision, 31 bits - available for an integer only 1 bit is needed for the sign range for integers from−(231−1) to (231−1) = In double precision, 63 bits - for integers in the range −(263 − 1) to (263−1) integer arithmetic, accurate calculations can result in only approximately nine digits in single precision 18 digits in double precision! For high accuracy, most computations should be done by using double-precision floating-point arithmetic.

Ezample Determine the single-precision machine representation of the decimal number− in both single precision and double precision.

Solution Converting the integer part to binary
(52.)10 =(64.) 8 =( )2 converting the fractional part ( )10 = (.17)8 = ( )2 ( )10 = ( )2 = ( )2 × 25 one-plus form in base 2 ( )2 - stored mantissa. the exponent: (5)10 since c−127 = 5 (132)10 = (204)8 = ( )2 stored exponent the single-precision machine representation of − : [ ]2 = [ ]2 = [C250F000]16

In double precision, for the exponent (5)10,
(2004)8 = ( )2, - stored exponent. double-precision machine representation of − : [ · · · 00]2 = [ · · · 0000]2 = [C04A1E ]16 [· · ·]k - the bit pattern of the machine word(s) that represents floating-point numbers, displayed in base-k

Example Determine the decimal numbers that correspond to these machine words: [45DE4000]16 [BA390000]16

Solution The first number in binary: [ ]2 The stored exponent: ( )2 = (213)8 = (139)10, 139 − 127 = 12 The mantissa – positive: ( )2 × 212 = ( )2 = (15710.)8 = 0 × × × × × 84 = 8(1 + 8(7 + 8(5 + 8(1)))) = 7112

the second word in binary:
[ ]2 The exponential part of the word: ( )2 = (164)8 = 116 the exponent: 116 − 127 = −11 The mantissa - negative: −( )2 × 2−11 = −( )2 = −( )8 = −2 × 8−4 − 7 × 8−5 − 1 × 8−6 = −8−6(1 + 8(7 + 8(2))) = −185 / ≈ − × 10−4

Computer Errors in Representing Numbers
errors - occur represent a real number x in the computer a model computer bitword length for x = , x = 2−32591 exponents - far exceed the limitations of the machine overflow and underflow - respectively relative error in replacing x by the closest machine number - very large Such numbers are outside the range of a 32-bit word-length computer.

positive real number x - normalized floating-point form
x = q × 2m ,1/2 <= q < 1, −126 <= m <= 127 The process of replacing x by its nearest machine number - correct rounding the error involved - roundoff error q - expressed in normalized binary notation: x = (0.1b2b3b b24b25b )2 × 2m one nearby machine number by rounding down or by simply dropping the excess bits b25b , only 23 bits allocated stored mantissa This machine number: x− = (0.1b2b3b b24)2 × 2m lies to the left of x on the real-number axis.

Another machine number, x+, to the right of x on the real axis
obtained by rounding up adding one unit to b24 in the expression for x−: x+ = ](0.1b2b3b b24)2 + 2−24 ]× 2m The closer of these machine numbers is the one chosen to represent x |x − x−| <= (½)|x+ − x−| = 2−25+m the relative error is bounded:

where u = 2−24 is the unit roundoff error for a 32-bit binary computer with standard floating-point arithmetic Recall that the machine epsilon is ε = 2−23 so u = 1/2 ε. Moreover: u = 2−k , where k is the number of binary digits used in the mantissa, including the hidden bit (k = 24 in single precision and k = 53 in double precision)

On the other hand, if x lies closer to x+ than to x−, then
|x − x+| <= (1/2) |x+ − x-| and the same analysis shows that the relative error is no greater than 2−24 = u So in the case of rounding to the nearest machine number the relative error is bounded by u

In Case3 of Choping when all excess digits or bits are discarded, the process is called chopping If a 32-bit word-length computer has been designed to chop numbers, the relative error bound would be twice as large as above: 2u = 2−23 = ε.

A possibl relationsh between x−, x+, and x

Notation fl(x) and Backward Error Analysis
errors produced - elementary arithmetic operations to illustrate - a five-place decimal machine add numbers in normalized floatingpoint form x = × 104 y = × 10−1 Many computers perform arithmetic operations in a double-length work area, assume that our computer - a ten-place accumulator First, the exponent of the smaller number is adjusted - both exponents are the same. Then the numbers are added in the accumulator the rounded result is placed in a computer word: x = × 104 y = × 104 x + y = × 104

The nearest machine number is z = 0.37219 × 104
the relative error: would be regarded as acceptable on a machine of such low precision

introduce the notation fl(x)
denote the floating-point machine number corresponds to the real number x the function fl depends on the particular computer involved for the hypothetical five-decimal-digit machine: fl( × 104) = × 104 For a 32-bit word-length computer, x - any real number within the range of the computer

assume that correct rounding is used
This inequality can expressed in the form fl(x) = x(1 + δ) |δ| <= 2−24 two inequalities equivalent let δ = [fl(x) − x]/x by |δ| <= 2−24 solving for fl(x) - fl(x) = x(1 + δ)

details in the addition 1 + ε
if ε >= 2−23, then fl(1 + ε) > 1, if ε < 2−23, then fl(1 + ε) = 1 Consequently, if machine epsilon is the smallest positive machine number ε such that fl(1 + ε) > 1 then ε = 2−23.

the symbol arithmetic operations +, −, ×, or ÷.
a 32-bit word-length computer designed two machine numbers x and y computer produce fl(x  y) instead of x  y imagine that x  y first correctly formed then normalized finally rounded to become a machine number. the relative error not exceed 2−24 fl(x  y) = (x  y)(1 + δ)

Special cases: δ - variable −2−24 <= δ <= 2 24 The assumptions about - a model 32-bit word-length computer - not quite true x  y - overflow or underflow

the equation above can be written in a variety of ways,
alternative interpretations of roundoff e.g., fl(x + y) = x(1 + δ) + y(1 + δ) result of adding x and y not in general x + y true sum of x(1 + δ) and y(1 + δ) x(1 + δ) - result of slightly perturbing x. machine version of x + y -fl(x + y) exact sum of a slightly perturbed x and a slightly perturbed y

Backward v.s. Direct Error Analysis
backward error analysis attempts to determine what perturbation of the original data would cause the computer results to be the exact results for a perturbed problem direct error analysis how computed answers differ from exact answers based on the same data

Example If x, y, and z are machine numbers in a 32-bit word-length computer what upper bound can be given for the relative roundoff error in computing z(x + y)?

Solution In the computer x + y will be done first
This arithmetic operation produces the machine number fl(x + y), differs from x + y roundoff there is a δ1 such that fl(x + y) = (x + y)(1 + δ1) |δ1| <= 2−24 z - a machine number When multiplies the machine number fl(x + y), the machine number fl[z fl(x + y)] differs from its exact counterpart, for some δ2 fl[z fl(x + y)] = z fl(x + y)(1 + δ2) |δ2| <= 2-24

| δ1δ2| <= 2−48 ignore it δ = δ1 + δ2 |δ| = |δ1 + δ2| <= |δ1| + |δ2| <= 2− −24 = 2 −23

Example Critique the following attempt to estimate the relative roundoff error in computing the sum of two real numbers, x and y. In a 32-bit word-length computer, the calculation yields Therefore, the relative error is bounded as follows: Why is this calculation not correct?

the quantities δ that occur in such calculations are not, in general, equal to each other
The correct calculation is

the relative roundoff error:

cannot be bounded the second term has a denominator that can be zero or close to zero if x and y are machine numbers, then δ1 and δ2 are zero and a useful bound results—namely, δ3. But we do not need this calculation to know that! It has been assumed that when machine numbers are combined with any of the four arithmetic operations, the relative roundoff error will not exceed 2−24 in magnitude

Computational Methods

Similar presentations

Presentation on theme: "Computational Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Methods

Similar presentations

Presentation on theme: "Computational Methods"— Presentation transcript:

Similar presentations

About project

Feedback