4MIPS Arithmetic Logic Unit (ALU) 32m (operation)resultABALU4zeroovf1Must support the Arithmetic/Logic operations of the ISAadd, addi, addiu, addusub, subu, negmult, multu, div, divusqrtand, andi, nor, or, ori, xor, xoribeq, bne, slt, slti, sltiu, sltuCarryOutWith special handling forsign extend – addi, andi, ori, xori, sltizero extend – lbu, addiu, sltiuno overflow detected – addu, addiu, subu, multu, divu, sltiu, sltu
5MIPS Arithmetic Logic Unit (cont.) 43232FIGURE C.5.13 The values of the three ALU control lines, Bnegate, and Operation, andthe corresponding ALU operations.32add, addi, addiu, addusub, subu, beq, bneFig. C.5.14
6Review: ALU Construction 1. AND gate (c=a•b)2. OR gate (c=a+b)3. Inverter (c=-a)4. Multiplexor (if d==0, c=a; Else c=b)7
8Addition & Subtraction (cont.) Just like in grade school (carry/borrow 1s) 0101Two's complement operations easysubtraction using addition of negative numbers 1010Overflow (result too large for finite computer word):e.g., adding two n-bit numbers does not yield an n-bit number
9Review: A Full Adder How can we use it to build a 32-bit adder? carry_inABcarry_incarry_outS1A1-bit Full AdderSBcarry_outS = A B carry_in (odd parity function)carry_out = A&B | A&carry_in | B&carry_inHow can we use it to build a 32-bit adder?How can we modify it easily to build an adder/subtractor?
10A 32-bit Ripple Carry Adder/Subtractor add/sub1-bit FAS0c0=carry_inc1S1c2S2c3c32=carry_outS31c31. . .A0A1A2A31Remember 2’s complement is justcomplement all the bitsadd a 1 in the least significant bit1B0control(0=add,1=sub)B0 if control = 0, !B0 if control = 1A B +1001For lecture100011 0001= A - B
11Overflow DetectionOverflow: the result is too large to represent in 32 bitsNo overflow when adding a positive and a negative numberNo overflow when signs are the same for subtractionOverflow occurs whenadding two positives yields a negativeor, adding two negatives gives a positiveor, subtract a negative from a positive gives a negativeor, subtract a positive from a negative gives a positiveOn your own: Prove you can detect overflow by:Carry into MSB xor Carry out of MSB, ex for 4 bit signed numbersFor class handout1+731+–4– 5
12Overflow Detection (cont.) Overflow: the result is too large to represent in 32 bitsOverflow occurs whenadding two positives yields a negativeor, adding two negatives gives a positiveor, subtract a negative from a positive gives a negativeor, subtract a positive from a negative gives a positiveOn your own: Prove you can detect overflow by:Carry into MSB xor Carry out of MSB, ex for 4 bit signed numbersFor lectureRecalled from some earlier slides that the biggest positive number you can represent using 4-bit is 7 and the smallest negative you can represent is negative 8.So any time your addition results in a number bigger than 7 or less than negative 8, you have an overflow.Keep in mind is that whenever you try to add two numbers together that have different signs, that is adding a negative number to a positive number, overflow can NOT occur.Overflow occurs when you to add two positive numbers together and the sum has a negative sign. Or, when you try to add negative numbers together and the sum has a positive sign.If you spend some time, you can convince yourself that If the Carry into the most significant bit is NOT the same as the Carry coming out of the MSB, you have a overflow.1111111+731+–4– 5– 6117
15p.227 (頁229) Effects of Overflow An exception (interrupt) occurs Control jumps to predefined address for exceptionInterrupted address is saved for possible resumption (EPC)Don't always want to detect overflow — new MIPS instructions: addu, addiu, subu
18Arithmetic for Multimedia Graphics and media processing operates on vectors of 8-bit and 16-bit dataUse 64-bit adder, with partitioned carry chainOperate on 8×8-bit, 4×16-bit, or 2×32-bit vectorsSIMD (single-instruction, multiple-data)Saturating operationsOn overflow, result is largest representable valuec.f. 2s-complement modulo arithmeticE.g., clipping in audio, saturation in video
19Clipping in AudioClipping is a form of waveform distortion that occurs when an amplifier is overdriven and attempts to deliver an output voltage or current beyond its maximum capability.
20p.230 (頁233) 3.3 Multiplication Unsigned multiply example : Example. 1000x 10111000_0000__1000___Example.(0010)2 x (0011)2:0010x 00110010_0000__0000___MultiplicandMultiplier17
21Multiplication (cont.) Binary multiplication is just a bunch of left shifts and addsnmultiplicandmultiplierpartialproductarraycan be formed in parallel and added in parallel for faster multiplicationndouble precision product2n
22Multiplication (cont.) More complicated than additionaccomplished via shifting and additionMore time and more areaLet's look at 3 versions based on a gradeschool algorithm (multiplicand) __x_ (multiplier)Negative numbers: convert and multiplythere are better techniques, we won’t look at them
23Multiplication: Implementation First versionMultiplication: Implementationp(頁 )1000x 10111000_0000__1000___Control(Fig. 3.5)Datapath (Fig. 3.4)10110001000000100101XDone!
24Multiplication: Refined Version Multiplier starts in right half of productFig. 3.6p.233 (頁236)把被乘數加到乘積的 左半邊 ，然後把結果放到乘積暫存器的左半邊What goes here?1000MultiplicandDone!32-bit ALU01100011011010111000010100000100110010000010101101010001Shift rightX
26MIPS Multiply Instruction Multiply produces a double precision productmult $t1, $t2 # hi||lo=$t1 * $t2move from register Lo
27MIPS Multiply Instruction (cont.) Multiply produces a double precision productmult $s0, $s1 # hi||lo = $s0 * $s1Low-order word of the product is left in processor register lo and the high-order word is left in register hiInstructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register fileop rs rt rd shamt functMultiplies are done by fast, dedicated hardware and are much more complex (and slower) than addersHardware dividers are even more complex and even slowermultu – does multiply unsignedBoth multiplies ignore overflow, so its up to the software to check to see if the product is too big to fit into 32 bits. There is no overflow if hi is 0 for multu or the replicated sign of lo for mult.
283.4 DivisionDivision is just a bunch of quotient digit guesses and right shifts and subtractsDividend = Quotient X Divisor + Remainder(n + 1)nnquotientdividend (2n – 1)divisor(2n)partialremainderarrayremaindern
29p.237 (頁241) Fig. 3.9 Division: First version 0000 0010 0000 0100 00011001000100000… …p. 186(頁183)0000000000000001100000001SubtractSubtractSubtractSubtractSubtractAddAddAdd33 repetitions=#dividend - #divisor + 1=
32MIPS Divide Instruction Divide generates the reminder in hi and the quotient in lodiv $s0, $s1 # lo = $s0 / $s1# hi = $s0 mod $s1Instructions mfhi rd and mflo rd are provided to move the quotient and reminder to (user accessible) registers in the register fileQuotientRemainderop rs rt rd shamt functSeems odd to me that the machine doesn’t support a double precision dividend in hi || lo but it looks like it doesn’tAs with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0.
33MIPS Multiply/Divide Summary Move To register Lo
353.5 Floating Point (a brief look) We need a way to representnumbers with fractions, e.g.,very small numbers, e.g.,very large numbers, e.g., 109Representation:sign, exponent, significand: (–1)sign significand 2exponentmore bits for significand gives more accuracymore bits for exponent increases rangeIEEE 754 floating point standard:single precision: 8 bit exponent, 23 bit significanddouble precision: 11 bit exponent, 52 bit significand
36Representing Big (and Small) Numbers What if we want to encode the approx. age of the earth?4,600,000, or x 109or the weight in kg of one a.m.u. (atomic mass unit)or x 10-27There is no way we can encode either of the above in a 32-bit integer.p.245(頁249)Floating point representation (-1)sign x F x 2EStill have to fit everything in 32 bits (single precision)Notice that in scientific notation the mantissa is represented in normalized form (no leading zeros)s E (exponent) F (fraction)1 bit bits bitsThe base (2, not 10) is hardwired in the design of the FPALUMore bits in the fraction (F) or the exponent (E) is a trade-off between precision (accuracy of the number) and range (size of the number)
37IEEE 754 FP Standard Sign and magnitude representation (-1) * (1+Significand) * 2Single precisionDouble precisionOverflow(溢位)/Underflow(短值)是由於指數太大/太小 而無法在指數欄位上表示出來eg., =0.110x2-1=1.100x2-2hidden bitsEExponentSignificandS1-bit8-bit23-bit32 bitsSExponentSignificand1-bit11-bit52-bit64 bits23
38IEEE 754 FP Standard (cont.) IEEE 754 的偏差值(Bias)在單精度方面為127，在倍精度方面為1023(-1) * (1+Significand) * 2Exponent = E + biasE = Exponent - biasSEE128127…-1-127Exponent255254126Bias
40IEEE 754 FP Standard (cont.) Most computers these days conform to the IEEE 754 floating point standard (-1)sign x (1+F) x 2E-biasFormats for both single and double precisionF is stored in normalized form where the msb in the fraction is 1 (so there is no need to store it!) – called the hidden bitTo simplify sorting FP numbers, E comes before F in the word and E is represented in excess (biased) notationp.246(頁251) Fig. 3.14Single PrecisionDouble PrecisionObject RepresentedE (8)F (23)E (11)F (52)true zero (0)nonzero± denormalized number1-254anything1-2046± floating point number2552047± infinitynot a number (NaN)SBook distinguishes between the representation with the hidden bit there – significand (the 24 bit version) , and the one with the hidden bit removed – fraction (the 23 bit version)
41Floating Point Complexities Operations are somewhat more complicated (see text)In addition to overflow we can have “underflow”Accuracy can be a big problemIEEE 754 keeps two extra bits, guard and roundfour rounding modespositive divided by zero yields “infinity”zero divide by zero yields “not a number”other complexitiesImplementing the standard can be tricky
42Floating Point Addition Addition (and subtraction)(F1 2E1) + (F2 2E2) = F3 2E3Step 1: Restore the hidden bit in F1 and in F2Step 1: Align fractions by right shifting F2 by E1 - E2 positions (assuming E1 E2) keeping track of (three of) the bits shifted out in a guard bit, a round bit, and a sticky bitStep 2: Add the resulting F2 to F1 to form F3Step 3: Normalize F3 (so it is in the form 1.XXXXX …)If F1 and F2 have the same sign F3 [1,4) 1 bit right shift F3 and increment E3If F1 and F2 have different signs F3 may require many left shifts each time decrementing E3Step 4: Round F3 and possibly normalize F3 againx 24 vs x22= x 24, ……Note that smaller significand is the one shifted (while increasing its exponent until its equal to the larger exponent)overflow/underflow can occur during addition and during normalizationrounding may lead to overflow/underflow as well=
43Floating point addition p.252 (頁257) Fig. 3.15Still normalized?eg.,Step 5: Rehide the most significant bit of F3 before storing the result
46MIPS Floating Point Instructions MIPS has a separate Floating Point Register File ($f0, $f1, …, $f31) (whose registers are used in pairs for double precision values) with special instructions to load to and store from themlwcl $f0,54($s2) #$f0 = Memory[$s2+54]swcl $f0,58($s4) #Memory[$s4+58] = $f0And supports IEEE 754 singleadd.s $f2,$f4,$f6 #$f2 = $f4 + $f6and double precision operationsadd.d $f2,$f4,$f6 #$f2||$f3 = $f4||$f5 + $f6||$f7similarly for sub.s, sub.d, mul.s, mul.d, div.s, div.dFrom/To coprocessor 1
47MIPS Floating Point Instructions, Con’t And floating point single precision comparison operationsc.lt.s $f2,$f4 #if($f2 < $f4) cond=1; else cond=0where lt may be replaced with eq, neq, le, gt, geand branch operationsbclt #if(cond==1) go to PC+4+100bclf #if(cond==0) go to PC+4+100And double precision comparison operationsc.lt.d $f2,$f #$f2||$f3 < $f4||$f cond=1; else cond=0
553.6 FP add, subtract associative? Parallel programs may interleave operations in unexpected ordersAssumptions of associativity may failNeed to validate parallel programs under varying degrees of parallelism
56x86 FP ArchitectureOriginally based on 8087 FP coprocessor8 × 80-bit extended-precision registersUsed as a push-down stackRegisters indexed from TOS: ST(0), ST(1), …FP values are 32-bit or 64 in memoryConverted on load/store of memory operandInteger operands can also be converted on load/storeVery difficult to generate and optimize codeResult: poor FP performance
57Streaming SIMD Extension 2 (SSE2) Adds 4 × 128-bit registersExtended to 8 registers in AMD64/EM64TCan be used for multiple FP operands2 × 64-bit double precision4 × 32-bit double precisionInstructions operate on them simultaneouslySingle-Instruction Multiple-Data
58Streaming SIMD Extensions In computing, Streaming SIMD Extensions (SSE) is a SIMD instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to AMD's 3DNow! (which had debuted a year earlier). SSE contains 70 new instructions.SSE originally added eight new 128-bit registers known as XMM0 through XMM7. The AMD64 extensions from AMD (originally called x86-64 and later duplicated by Intel) add a further eight registers XMM8 through XMM15.SSE2, introduced with the Pentium 4, is a major enhancement to SSE. SSE2 adds new math instructions for double-precision (64-bit) floating point and also extends MMX instructions to operate on 128- bit XMM registers.SSE3, is an incremental upgrade to SSE2, adding a handful of DSP- oriented mathematics instructions and some process (thread) management instructions.SSE4 is another major enhancement, adding a dot product instruction, additional integer instructions, a popcnt instruction, and more.
59Right Shift and Division Left shift by i places multiplies an integer by 2iRight shift divides by 2i?Only for unsigned integersFor signed integersArithmetic right shift: replicate the sign bite.g., –5 / 4>> 2 = = –2Rounds toward –∞c.f >>> 2 = = +62§3.8 Fallacies and Pitfalls
60Who Cares About FP Accuracy? Important for scientific codeBut for everyday consumer use?“My bank balance is out by ¢!” The Intel Pentium FDIV bugThe market expects accuracySee Colwell, The Pentium Chronicles
61Concluding Remarks ISAs support arithmetic Bounded range and precision Signed and unsigned integersFloating-point approximation to realsBounded range and precisionOperations can overflow and underflowMIPS ISACore instructions: 54 most frequently used100% of SPECINT, 97% of SPECFPOther instructions: less frequent§3.9 Concluding Remarks