© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 University of Illinois, Urbana-Champaign 1 CSCE 4643 GPU Programming Lecture 8 - Floating Point Considerations.

Slides:



Advertisements
Similar presentations
L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.
Advertisements

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.
Fabián E. Bustamante, Spring 2007 Floating point Today IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.
Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.
Topics covered: Floating point arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3:
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3: IT Students.
Floating Point Numbers
CMPE12cCyrus Bazeghi 1 What do floating-point numbers represent? Rational numbers with non-repeating expansions in the given base within the specified.
Assembly Language and Computer Architecture Using C++ and Java
Number Systems Standard positional representation of numbers:
Assembly Language and Computer Architecture Using C++ and Java
CSE 378 Floating-point1 How to represent real numbers In decimal scientific notation –sign –fraction –base (i.e., 10) to some power Most of the time, usual.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Numerical Computations in Linear Algebra. Mathematically posed problems that are to be solved, or whose solution is to be confirmed on a digital computer.
Information Representation (Level ISA3) Floating point numbers.
Computer Organization and Architecture Computer Arithmetic Chapter 9.
Computer Arithmetic Nizamettin AYDIN
1 Lecture 5 Floating Point Numbers ITEC 1000 “Introduction to Information Technology”
CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Number Systems So far we have studied the following integer number systems in computer Unsigned numbers Sign/magnitude numbers Two’s complement numbers.
Computing Systems Basic arithmetic for computers.
Computer Architecture
ECE232: Hardware Organization and Design
Floating Point. Agenda  History  Basic Terms  General representation of floating point  Constructing a simple floating point representation  Floating.
Data Representation in Computer Systems
S. Rawat I.I.T. Kanpur. Floating-point representation IEEE numbers are stored using a kind of scientific notation. ± mantissa * 2 exponent We can represent.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /14/2013 Lecture 16: Floating Point Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.
L9: Project Discussion and Floating Point Issues CS6235.
07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Oct. 18, 2007SYSC 2001* - Fall SYSC2001-Ch9.ppt1 See Stallings Chapter 9 Computer Arithmetic.
Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 15 - Floating.
CSC 221 Computer Organization and Assembly Language
Floating Point Arithmetic
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 16 - Floating.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3:
1 Floating Point Operations - Part II. Multiplication Do unsigned multiplication on the mantissas including the hidden bits Do unsigned multiplication.
Computer Arithmetic See Stallings Chapter 9 Sep 10, 2009
Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.
Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.
Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.
1 Floating Point. 2 Topics Fractional Binary Numbers IEEE 754 Standard Rounding Mode FP Operations Floating Point in C Suggested Reading: 2.4.
Binary Numbers The arithmetic used by computers differs in some ways from that used by people. Computers perform operations on numbers with finite and.
Cosc 2150: Computer Organization Chapter 9, Part 3 Floating point numbers.
Floating Point Representations
© David Kirk/NVIDIA and Wen-mei W
Computer Architecture & Operations I
Integer Division.
Topics IEEE Floating Point Standard Rounding Floating Point Operations
Floating Point Numbers: x 10-18
Floating Point Number system corresponding to the decimal notation
Arithmetic for Computers
Topic 3d Representation of Real Numbers
Luddy Harrison CS433G Spring 2007
ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations
How to represent real numbers
CSCE 4643 GPU Programming Lecture 9 - Floating Point Considerations
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Topic 3d Representation of Real Numbers
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CSCE 4643 GPU Programming Lecture 8 - Floating Point Considerations

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 2 Objective To understand the fundamentals of floating-point representation To know the IEEE-754 Floating Point Standard CUDA GPU Floating-point speed, accuracy and precision –Cause of errors –Algorithm considerations –Deviations from IEEE-754 –Accuracy of device runtime functions –-fastmath compiler option –Future performance considerations

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 3 What is IEEE floating-point format? A floating point binary number consists of three parts: –sign (S), exponent (E), and mantissa (M). –Each (S, E, M) pattern uniquely identifies a floating point number. For each bit pattern, its IEEE floating-point value is derived as: –value = (-1) S * M * {2 E }, where 1.0 ≤ M < 10.0 B The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 4 Normalized Representation Specifying that 1.0 B ≤ M < 10.0 B makes the mantissa value for each floating point number unique. –For example, the only one mantissa value allowed for 0.5 D is M = D = 1.0 B * 2 -1 –Neither 10.0 B * 2 -2 nor 0.1 B * 2 0 qualifies Because all mantissa values are of the form 1.XX…, one can omit the “1.” part in the representation. –The mantissa value of 0.5 D in a 2-bit mantissa is 00, which is derived by omitting “1.” from –Mantissa without implied 1 is called the fraction

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 5 Exponent Representation In an n-bits exponent representation, 2 n-1 -1 is added to its 2's complement representation to form its excess representation. –See Table for a 3-bit exponent representation A simple unsigned integer comparator can be used to compare the magnitude of two FP numbers Symmetric range for +/- exponents (111 reserved) 2’s complementActual decimalExcess (reserved pattern)

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 6 A simple, hypothetical 5-bit FP format Assume 1-bit S, 2-bit E, and 2-bit M –0.5 D = 1.00 B * 2 -1 –0.5 D = , where S = 0, E = 00, and M = (1.)00 2’s complementActual decimalExcess (reserved pattern) 11 00

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 7 Representable Numbers The representable numbers of a given format is the set of all numbers that can be exactly represented in the format. See Table for representable numbers of an unsigned 3- bit integer format

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 8 Representable Numbers of a 5-bit Hypothetical IEEE Format 0 No-zero Abrupt underflowGradual underflow EM S=0S=1 S=0S=1S=0S= (2 -1 ) *2 -3 -( *2 -3 ) 001* * *2 -3 -( *2 -3 ) 002* * *2 -3 -( *2 -3 ) 003* * (2 0 ) *2 -2 -(2 0 +1*2 -2 ) *2 -2 -(2 0 +1*2 -2 )2 0 +1*2 -2 -(2 0 +1*2 -2 ) *2 -2 -(2 0 +2*2 -2 ) *2 -2 -(2 0 +2*2 -2 )2 0 +2*2 -2 -(2 0 +2*2 -2 ) *2 -2 -(2 0 +3*2 -2 ) *2 -2 -(2 0 +3*2 -2 )2 0 +3*2 -2 -(2 0 +3*2 -2 ) (2 1 ) *2 -1 -(2 1 +1*2 -1 ) *2 -1 -(2 1 +1*2 -1 )2 1 +1*2 -1 -(2 1 +1*2 -1 ) *2 -1 -(2 1 +2*2 -1 ) *2 -1 -(2 1 +2*2 -1 )2 1 +2*2 -1 -(2 1 +2*2 -1 ) *2 -1 -(2 1 +3*2 -1 ) *2 -1 -(2 1 +3*2 -1 )2 1 +3*2 -1 -(2 1 +3*2 -1 ) 11Reserved pattern Cannot represent Zero!

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 9 Flush to Zero Treat all bit patterns with E=0 as 0.0 –This takes away several representable numbers near zero and lump them all into 0.0 –For a representation with large M, a large number of representable numbers numbers will be removed

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 10 Flush to Zero 0 No-zero Flush to Zero Denormalized EMS=0S=1 S=0S=1 S=0S= (2 -1 ) *2 -3 -( *2 -3 ) 00 1* * *2 -3 -( *2 -3 ) 00 2* * *2 -3 -( *2 -3 ) 00 3* * (2 0 ) *2 -2 -(2 0 +1*2 -2 ) *2 -2 -(2 0 +1*2 -2 ) *2 -2 -(2 0 +1*2 -2 ) *2 -2 -(2 0 +2*2 -2 ) *2 -2 -(2 0 +2*2 -2 ) *2 -2 -(2 0 +2*2 -2 ) *2 -2 -(2 0 +3*2 -2 ) *2 -2 -(2 0 +3*2 -2 ) *2 -2 -(2 0 +3*2 -2 ) (2 1 ) *2 -1 -(2 1 +1*2 -1 ) *2 -1 -(2 1 +1*2 -1 ) *2 -1 -(2 1 +1*2 -1 ) *2 -1 -(2 1 +2*2 -1 ) *2 -1 -(2 1 +2*2 -1 ) *2 -1 -(2 1 +2*2 -1 ) *2 -1 -(2 1 +3*2 -1 ) *2 -1 -(2 1 +3*2 -1 ) *2 -1 -(2 1 +3*2 -1 ) 11Reserved pattern

Why is flushing to zero problematic? Many physical model calculations work on values that are very close to zero –Dark (but not totally black) sky in movie rendering –Small distance fields in electrostatic potential calculation –… Without Denormalization, these calculations tend to create artifacts that compromise the integrity of the models © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 11

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 12 Denormalized Numbers The actual method adopted by the IEEE standard is called denromalized numbers or gradual underflow. –The method relaxes the normalization requirement for numbers very close to 0. –whenever E=0, the mantissa is no longer assumed to be of the form 1.XX. Rather, it is assumed to be 0.XX. In general, if the n-bit exponent is 0, the value is 0.M * ^(n-1)

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 13 Denormalization 0 No-zeroFlush to Zero Denormalized EMS=0S=1S=0S=1 S=0S= (2 -1 ) *2 -3 -( *2 -3 )00 1* * *2 -3 -( *2 -3 )00 2* * *2 -3 -( *2 -3 )00 3* * (2 0 ) *2 -2 -(2 0 +1*2 -2 )2 0 +1*2 -2 -(2 0 +1*2 -2 ) *2 -2 -(2 0 +1*2 -2 ) *2 -2 -(2 0 +2*2 -2 )2 0 +2*2 -2 -(2 0 +2*2 -2 ) *2 -2 -(2 0 +2*2 -2 ) *2 -2 -(2 0 +3*2 -2 )2 0 +3*2 -2 -(2 0 +3*2 -2 ) *2 -2 -(2 0 +3*2 -2 ) (2 1 ) *2 -1 -(2 1 +1*2 -1 )2 1 +1*2 -1 -(2 1 +1*2 -1 ) *2 -1 -(2 1 +1*2 -1 ) *2 -1 -(2 1 +2*2 -1 )2 1 +2*2 -1 -(2 1 +2*2 -1 ) *2 -1 -(2 1 +2*2 -1 ) *2 -1 -(2 1 +3*2 -1 )2 1 +3*2 -1 -(2 1 +3*2 -1 ) *2 -1 -(2 1 +3*2 -1 ) 11Reserved pattern

IEEE 754 Format and Precision Single Precision –1 bit sign, 8 bit exponent (127-bias excess), 23 bit fraction Double Precision –1 bit sign, 11 bit exponent (1023-bias excess), 52 bit fraction –The largest error for representing a number is reduced to 1/2 29 of single precision representation © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 14

Special Bit Patterns © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 15 exponentmantissameaning 11…1≠ 0NaN 11…1=0(-1) S * ∞ 00…0≠0denormalized 00…0=00 An ∞ can be created by overflow, e.g., divided by a very small number. –Any representable number divided by +∞ or -∞ results in 0. NaN (Not a Number) is generated by operations whose input values do not make sense, for example, 0/0, 0*∞, ∞/∞, ∞ - ∞. –also used to for data that have not been properly initialized in a program. –Signaling NaN’s (SNaNs) are represented with most significant mantissa bit cleared whereas –Quiet NaN’s are represented with most significant mantissa bit set.

Floating Point Accuracy and Rounding The accuracy of a floating point arithmetic operation is measured by the maximal error introduced by the operation. The most common source of error in floating point arithmetic is when the operation generates a result that cannot be exactly represented and thus requires rounding. Rounding occurs if the mantissa of the result value needs too many bits to be represented exactly. © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 16

Rounding and Error Assume our 5-bit representation, consider 1.0*2 -2 (0, 00, 01) *2 1 (0, 10, 00) The hardware needs to shift the mantissa bits in order to align the correct bits with equal place value with each other 0.001*2 1 (0, 00, 01) *2 1 (0, 10, 00) The ideal result would be * 2 1 (0, 10, 001) but this would require 3 mantissa bits! © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 17 denorm

Rounding and Error In some cases, the hardware may only perform the operation on a limited number of bits for speed and area cost reasons –An adder may only have 3 bit positions in our example so the first operand would be treated as a *2 1 (0, 00, 00) *2 1 (0, 10, 00) © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 18

Error Measure If an hardware adder has at least two more bit positions than the total (both implicit and explicit) number of mantissa bits, the error would never be more than half of the place value of the mantissa –0.001 in our 5-bit format We refer to this as 0.5 ULP (Units in the Last Place). –If the hardware is designed to perform arithmetic and rounding operations perfectly, the most error that one should introduce should be no more than 0.5 ULP. –The error is actually limited by the precision for this case. © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 19

Order of operations matters. Floating Point operations are not strictly associative The root cause is that some times a very small number can disappear when added to or subtracted from a very large number. –(Large + Small) + Small ≠ Large + (Small + Small) © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 20

Algorithm Considerations Sequential sum (assuming 3-bit adder) 1.00* * * *2 -2 = 1.00* * *2 -2 = 1.00* *2 -2 = 1.00*2 1 Parallel reduction (assuming 3-bit adder) (1.00* *2 0 ) + (1.00* *2 -2 ) = 1.00* *2 -1 = 1.01*2 1 © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 21

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 22 Runtime Math Library There are two types of runtime math operations –__func(): direct mapping to hardware ISA Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y) –func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) The -use_fast_math compiler option forces every func() to compile to __func()

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 23 Make your program float-safe! GPU hardware has double precision support –Double precision will have additional performance cost –Careless use of double or undeclared types may run more slowly Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed –Add ‘f’ specifier on float literals: foo = bar * 0.123; // double assumed foo = bar * 0.123f; // float explicit –Use float version of standard library functions foo = sin(bar); // double assumed foo = sinf(bar); // single precision explicit

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 24 Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant –Maximum 0.5 ulp (units in the least place) error Division is non-compliant (2 ulp) Not all rounding modes are supported No mechanism to detect floating-point exceptions

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 25 GPU Floating Point Features FermiSSEIBM AltivecCell SPE PrecisionIEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only Denormal handlingSupported Supported, 1000’s of cycles Flush to zero NaN supportYes No Overflow and Infinity support Yes, only clamps to max norm Yes No, infinity FlagsNoYes Some Square rootSoftware onlyHardwareSoftware only DivisionSoftware onlyHardwareSoftware only Reciprocal estimate accuracy 24 bit12 bit Reciprocal sqrt estimate accuracy 23 bit12 bit log2(x) and 2^x estimates accuracy 23 bitNo12 bitNo

Numerical Stability Some order of floating-point operations may cause some applications to fail Linear system solvers may require different ordering of floating-point operations for different input values An algorithm that can always find an appropriate operation order and thus a solution to the problem is a numerically stable algorithm –An algorithm that falls short is numerically unstable © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 26

3X+ 5Y+2Z= 19 2X+ 3Y+ Z= 11 X+ 2Y+ 2Z= 11 Original X+ 5/3Y+ 2/3Z= 19/3 X+ 3/2Y+ 1/2Z= 11/2 X+ 2Y+ 2Z= 11 Step 1: divide equation 1 by 3, equation 2 by 2 X+ 5/3Y+2/3Z= 19/3 - 1/6Y- 1/6Z= -5/6 1/3Y+ 4/3Z= 14/3 Step 2: subtract equation 1 from equation 2 and equation 3 Gaussian Elimination Example

X+ 5/3Y+2/3Z= 19/3 - 1/6Y- 1/6Z= -5/6 1/3Y+ 4/3Z= 14/3 X+ 5/3Y+ 2/3Z= 19/3 Y+ Z= 5 Y+ 4Z= 14 Step 3: divide equation 2 by -1/6 and equation 3 by 1/3 X+ 5/3Y+ 2/3Z= 19/3 Y+ Z= 5 + 3Z= 9 Step 4: subtract equation 2 from equation 3 Gaussian Elimination Example (Cont.)

X+ 5/3Y+2/3Z= 19/3 Y+ Z= 5 Z= 3 Step 5: divide equation 3 by 3 We have solution for Z! X+ 5/3Y+2/3Z= 19/3 Y = 2 Z= 3 Step 6: substitute Z solution into equation 2. Solution for Y! X= 1 Y = 2 Z= 3 Step 7: substitute Y and Z into equation 1. Solution for X! X+ 5/3Y+ 2/3Z= 19/3 Y+ Z= 5 + 3Z= 9 Gaussian Elimination Example (Cont.)

Original 15/32/319/3 13/21/211/ Step 1: divide row 1 by 3, row 2 by 2 15/32/3 19/3 - 1/6 -5/6 1/34/314/3 Step 2: subtract row 1 from row 2 and row 3 15/32/3 19/ Step 3: divide row 2 by -1/6 and row 3 by 1/3 15/32/319/ Step 4: subtract row 2 from row 3 15/32/319/ Step 5: divide equation 3 by 3 Solution for Z! 15/32/319/ Step 6: substitute Z solution into equation 2. Solution for Y! Step 7: substitute Y and Z into equation 1. Solution for X!

Basic Gaussian Elimination is Easy to Parallelize Have each thread to perform all calculations for a row –All divisions in a division step can be done in parallel –All subtractions in a subtraction step can be done in parallel –Will need barrier synchronization after each step However, there is a problem with numerical stability © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 31

Pivoting © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Pivoting: Swap row 1 (Equation 1) with row 2 (Equation 2) 13/21/2 11/ Step 1: divide row 1 by 3, no need to divide row 2 or row 3

Pivoting (Cont.) © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 33 13/21/2 11/ /23/211/2 13/21/2 11/ Step 2: subtract row 1 from row 3 (column 1 of row 2 is already 0) 13/21/211/2 12/516/ Step 3: divide row 2 by 5 and row 3 by 1/2

Pivoting (Cont.) © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 34 13/21/211/2 12/516/ /21/211/2 12/5 16/5 13/5 39/5 Step 4: subtract row 2 from row 3 Step 5: divide row 3 by 13/5 Solution for Z! 15/32/319/3 12/5 16/5 1 3

Step 6: substitute Z solution into equation 2. Solution for Y! Step 7: substitute Y and Z into equation 1. Solution for X! 15/32/319/ Pivoting (Cont.) 15/32/319/3 12/5 16/5 1 3

Original 13/21/211/2 12/516/ Step 1: divide row 1 by 3, no need to divide row 2 or row 3 13/21/2 11/ Step 2: subtract row 1 from row 3 (column 1 of row 2 is already 0) 13/21/2 11/ /23/211/2 Step 3: divide row 2 by 5 and row 3 by 1/2 Step 4: subtract row 2 from row 3 13/21/211/2 12/5 16/5 13/5 39/5 Step 5: divide row 3 by 13/5 Solution for Z! 15/32/319/3 12/5 16/5 1 3 Step 6: substitute Z solution into equation 2. Solution for Y! Step 7: substitute Y and Z into equation 1. Solution for X! Pivoting: Swap row 1 (Equation 1) with row 2 (Equation 2) 15/32/319/ Figure 7.11

Why is Pivoting Hard to Parallelize? Need to scan through all rows (in fact columns in general) to find the best pivoting candidate –A major disruption to the parallel computation steps –Most parallel algorithms avoid full pivoting –Thus most parallel algorithms have some level of numerical instability © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 37

ANY MORE QUESTIONS? READ CHAPTER 7 © David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 38