Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

ALU Implementation

Review: Basic Hardware

To warm up let's build a logic unit to support the and and or instructions for MIPS (32-bit registers) We'll just build a 1-bit unit and use 32 of them Possible implementation using a multiplexor : A Simple Multi-Function Logic Unit a b output operation selector

Selects one of the inputs to be the output based on a control input Implementation with a Multiplexor...... 32 units

Not easy to decide the best way to implement do not want too many inputs to a single gate do not want to have to go through too many gates (= levels) Let's look at a 1-bit ALU for addition: Implementations c out = a.b + a.c in + b.c in sum = a.b.c in + a.b.c in + a.b.c in + a.b.c in = a  b  c in exclusive or (xor)

Building a 1-bit Binary Adder 1 bit Full Adder A B S carry_in carry_out S = A xor B xor carry_in carry_out = A  B v A  carry_in v B  carry_in (majority function)  How can we use it to build a 32-bit adder?  How can we modify it easily to build an adder/subtractor? ABcarry_incarry_outS 00000 00101 01001 01110 10001 10110 11010 11111

1-bit Adder Logic Half-adder with one xor gate Full-adder from 2 half-adders and an or gate Half-adder with the xor gate replaced by primitive gates using the equation A  B = A.B +A.B xor

Building a 32-bit ALU Ripple-Carry Logic for 32-bit ALU 1-bit ALU for AND, OR and add Multiplexor control line

Two's complement approach: just negate b and add. How do we negate? recall negation shortcut : invert each bit of b and set CarryIn to least significant bit (ALU0) to 1 What about Subtraction (a – b) ?

Overflow Detection and Effects  Overflow: the result is too large to represent in the number of bits allocated  When adding operands with different signs, overflow cannot occur! Overflow occurs when l adding two positives yields a negative l or, adding two negatives gives a positive l or, subtract a negative from a positive gives a negative l or, subtract a positive from a negative gives a positive  On overflow, an exception (interrupt) occurs l Control jumps to predefined address for exception l Interrupted address (address of instruction causing the overflow) is saved for possible resumption  Don't always want to detect (interrupt on) overflow

Overflow Detection  Overflow can be detected by: l Carry into MSB xor Carry out of MSB  Logic for overflow detection therefore can be put in to ALU31 1 1 11 0 1 0 1 1 0 0111 0011+ 7 3 0 1 – 6 1100 1011+ –4 – 5 7 1 0

New MIPS Instructions CategoryInstrOp CodeExampleMeaning Arithmetic (R & I format) add unsigned0 and 21addu $s1, $s2, $s3$s1 = $s2 + $s3 sub unsigned0 and 23subu $s1, $s2, $s3$s1 = $s2 - $s3 add imm.unsigned 9addiu $s1, $s2, 6$s1 = $s2 + 6 Data Transfer ld byte unsigned 24lbu $s1, 25($s2)$s1 = Mem($s2+25) ld half unsigned25lhu $s1, 25($s2)$s1 = Mem($s2+25) Cond. Branch (I & R format) set on less than unsigned 0 and 2bsltu $s1, $s2, $s3if ($s2<$s3) $s1=1 else $s1=0 set on less than imm unsigned bsltiu $s1, $s2, 6if ($s2<6) $s1=1 else $s1=0  Sign extend – addiu, addiu, slti, sltiu  Zero extend – andi, ori, xori  Overflow detected – add, addi, sub

Tailoring the ALU to MIPS: Test for Less-than and Equality Need to support the set-on-less-than instruction e.g., slt $t0, $t3, $t4 remember: slt is an R-type instruction that produces 1 if rs < rt and 0 otherwise idea is to use subtraction: rs < rt  rs – rt < 0. Recall msb of negative number is 1 two cases after subtraction rs – rt: if no overflow then rs < rt  most significant bit of rs – rt = 1 if overflow then rs < rt  most significant bit of rs – rt = 0 why? e.g., 5 ten – 6 ten = 0101 – 0110 = 0101 + 1010 = 1111 (ok!) -7 ten – 6 ten = 1001 – 0110 = 1001 + 1010 = 0011 (overflow!) set bit is sent from ALU31 to ALU0 as the Less bit at ALU0; all other Less bits are hardwired 0; so Less is the 32-bit result of slt

Supporting slt 1- bit ALU for the 31 least significant bits 1-bit ALU for the most significant bit Extra set bit, to be routed to the Less input of the least significant 1-bit ALU, is computed from the most significant Result bit and the Overflow bit (it is not the output of the adder as the figure seems to indicate) Less input of the 31 most significant ALUs is always 0 32-bit ALU from 31 copies of ALU at top left and 1 copy of ALU at bottom left in the most significant position

Tailoring the ALU to MIPS: Test for Less-than and Equality Need to support test for equality e.g., beq $t5, $t6, $t7 use subtraction: rs - rt = 0  rs = rt do we need to consider overflow?

Supporting Test for Equality ALU control lines Bneg- Oper- Func- ate ation tion 0 00 and 0 01 or 0 10 add 1 10 sub 1 11 slt Symbol representing ALU Output is 1 only if all Result bits are 0 Combine CarryIn to least significant ALU and Binvert to a single control line as both are always either 1 or 0 32-bit MIPS ALU

Conclusion We can build an ALU to support the MIPS instruction set key idea: use multiplexor to select the output we want we can efficiently perform subtraction using two’s complement we can replicate a 1-bit ALU to produce a 32-bit ALU Important points about hardware speed of a gate depends number of inputs (fan-in) to the gate speed of a circuit depends on number of gates in series (particularly, on the critical path to the deepest level of logic) Speed of MIPS operations clever changes to organization can improve performance (similar to using better algorithms in software) we’ll look at examples for addition, multiplication and division

Building a 32-bit ALU Ripple-Carry Logic for 32-bit ALU 1-bit ALU for AND, OR and add Multiplexor control line

Is a 32-bit ALU as fast as a 1-bit ALU? Why? Is there more than one way to do addition? Yes: one extreme: ripple-carry – carry ripples through 32 ALUs, slow! other extreme: sum-of-products for each CarryIn bit – super fast! CarryIn bits: c 1 = b 0.c 0 + a 0.c 0 + a 0.b 0 c 2 = b 1.c 1 + a 1.c 1 + a 1.b 1 = a 1.a 0.b 0 + a 1.a 0.c 0 + a 1.b 0.c 0 (substituting for c 1 ) + b 1.a 0.b 0 + b 1.a 0.c 0 + b 1.b 0.c 0 + a 1.b 1 c 3 = b 2.c 2 + a 2.c 2 + a 2.b 2 = … = sum of 15 4-term products… How fast? But not feasible for a 32-bit ALU! Why? Exponential complexity!! Problem: Ripple-carry Adder is Slow Note: c i is CarryIn bit into i th ALU; c 0 is the forced CarryIn into the least significant ALU

An approach between our two extremes Motivation: if we didn't know the value of a carry-in, what could we do? when would we always generate a carry? (generate) g i = a i. b i when would we propagate the carry? (propagate) p i = a i + b i Express (carry-in equations in terms of generate/propagates) c 1 = g 0 + p 0.c 0 c 2 = g 1 + p 1.c 1 = g 1 + p 1.g 0 + p 1.p 0.c 0 c 3 = g 2 + p 2.c 2 = g 2 + p 2.g 1 + p 2.p 1.g 0 + p 2.p 1.p 0.c 0 c 4 = g 3 + p 3.c 3 = g 3 + p 3.g 2 + p 3.p 2.g 1 + p 3.p 2.p 1.g 0 + p 3.p 2.p 1.p 0.c 0 Feasible for 4-bit adders – with wider adders unacceptable complexity. solution: build a first level using 4-bit adders, then a second level on top Two-level Carry-lookahead Adder: First Level

Two-level Carry-lookahead Adder: Second Level for a 16-bit adder Propagate signals for each of the four 4-bit adder blocks: P 0 = p 3.p 2.p 1.p 0 P 1 = p 7.p 6.p 5.p 4 P 2 = p 11.p 10.p 9.p 8 P 3 = p 15.p 14.p 13.p 12 Generate signals for each of the four 4-bit adder blocks: G 0 = g 3 + p 3.g 2 + p 3.p 2.g 1 + p 3.p 2.p 1.g 0 G 1 = g 7 + p 7.g 6 + p 7.p 6.g 5 + p 7.p 6.p 5.g 4 G 2 = g 11 + p 11.g 10 + p 11.p 10.g 9 + p 11.p 10.p 9.g 8 G 3 = g 15 + p 15.g 14 + p 15.p 14.g 13 + p 15.p 14.p 13.g 12

Two-level Carry-lookahead Adder: Second Level for a 16-bit adder CarryIn signals for each of the four 4-bit adder blocks (see earlier carry-in equations in terms of generate/propagates): C 1 = G 0 + P 0.c 0 C 2 = G 1 + P 1.G 0 + P 1.P 0.c 0 C 3 = G 2 + P 2.G 1 + P 2.P 1.G 0 + P 2.P 1.P 0.c 0 C 4 = G 3 + P 3.G 2 + P 3.P 2.G 1 + P 3.P 2.P 1.G 0 + P 3.P 2.P 1.P 0.c 0

CarryIn Result0--3 CarryIn Result4--7 CarryIn Result8--11 CarryIn CarryOut Result12--15 CarryIn C1 C2 C3 C4 P0 G0 P1 G1 P2 G2 P3 G3 a0 b0 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a6 b6 a7 b7 a8 b8 a9 b9 a10 b10 a11 b11 a12 b12 a13 b13 a14 b14 a15 b15 Logic to compute C1, C2, C3, C4 4bAdder0 4bAdder1 4bAdder2 4bAdder3 16-bit carry-lookahead adder from four 4-bit adders and one carry-lookahead unit Carry-lookahead Logic ALU0 ALU1 ALU2 ALU3 a1 a0 a2 a3 b1 b0 b2 b3 s0 s1 s2 s3 Logic to compute c1, c2, c3, c4, P0, G0 Blow-up of 4-bit adder: (conceptually) consisting of four 1-bit ALUs plus logic to compute all CarryOut bits and one super generate and one super propagate bit. Each 1-bit ALU is exactly as for ripple-carry except c1, c2, c3 for ALUs 1, 2, 3 comes from the extra logic CarryIn Carry-lookahead Unit

Two-level carry-lookahead logic steps: 1. compute p i ’s and g i ’s at each 1-bit ALU 2. compute P i ’s and G i ’s at each 4-bit adder unit 3. compute C i ’s in carry-lookahead unit 4. compute c i ’s at each 4-bit adder unit 5. compute results (sum bits) at each 1-bit ALU Two-level Carry-lookahead Adder: Second Level for a 16-bit adder

Multiplication Standard shift-add method: Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000 m bits x n bits = m+n bit product Binary makes it easy: multiplier bit 1 => copy multiplicand (1 x multiplicand) multiplier bit 0 => place 0 (0 x multiplicand) 3 versions of multiply hardware & algorithms: x

Shift-add Multiplier Version 1 Algorithm Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000

Shift-add Multiplier Version 1 Multiplicand register, product register, ALU are 64-bit wide; multiplier register is 32-bit wide Algorithm 32-bit multiplicand starts at right half of multiplicand register Product register is initialized at 0

Shift-add Multiplier Version1 Itera Step Multiplier Multiplicand Product -tion 0 init 0011 0000 0010 0000 0000 values 1 1a 0011 0000 0010 0000 0010 2 0011 0000 0100 0000 0010 3 0001 0000 0100 0000 0010 2 … Example: 0010 * 0011: Algorithm

Observations on Multiply Version 1 1 step per clock cycle  nearly 100 clock cycles to multiply two 32-bit numbers Half the bits in the multiplicand register always 0  64-bit adder is wasted 0’s inserted to right as multiplicand is shifted left  least significant bits of product never change once formed Intuition: instead of shifting multiplicand to left, shift product to right…

Shift-add Multiplier Version 2 Algorithm Multiplicand 1000 Multiplier 1001 1000 0000 0000 1000 Product 01001000

Shift-add Multiplier Version 2 Multiplicand register, multiplier register, ALU are 32-bit wide; product register is 64-bit wide; multiplicand adds to left half of product register Algorithm Product register is initialized at 0

Shift-add Multiplier Version 2 Itera Step Multiplier Multiplicand Product -tion 0 init 0011 0010 0000 0000 values 1 1a 0011 0010 0010 0000 2 0011 0010 0001 0000 3 0001 0010 0001 0000 2 … Example: 0010 * 0011: Algorithm

Observations on Multiply Version 2 Each step the product register wastes space that exactly matches the current size of the multiplier Intuition: combine multiplier register and product register…

Shift-add Multiplier Version 3 No separate multiplier register; multiplier placed on right side of 64-bit product register Algorithm Product register is initialized with multiplier on right

Shift-add Multiplier Version 3 Itera Step Multiplicand Product -tion 0 init 0010 0000 0011 values 1 1a 0010 0010 0011 2 0010 0001 0001 2 … Example: 0010 * 0011: Algorithm

Observations on Multiply Version 3 2 steps per bit because multiplier & product combined What about signed multiplication? easiest solution is to make both positive and remember whether to negate product when done, i.e., leave out the sign bit, run for 31 steps, then negate if multiplier and multiplicand have opposite signs Booth’s Algorithm is an elegant way to multiply signed numbers using same hardware – it also often quicker…

MIPS Notes MIPS provides two 32-bit registers Hi and Lo to hold a 64-bit product mult, multu (unsigned) put the product of two 32-bit register operands into Hi and Lo : overflow is ignored by MIPS but can be detected by programmer by examining contents of Hi mflo, mfhi moves content of Hi or Lo to a general-purpose register Pseudo-instructions mul (without overflow), mulo (with overflow), mulou (unsigned with overflow) take three 32-bit register operands, putting the product of two registers into the third

Divide 1001 Quotient Divisor 1000 1001010 Dividend –1000 10 101 1010 –1000 10 Remainder standart method: see how big a multiple of the divisor can be subtracted, creating quotient digit at each step Binary makes it easy  first, try 1 * divisor; if too big, 0 * divisor Dividend = (Quotient * Divisor) + Remainder 3 versions of divide hardware & algorithm:

Divide Version 1 Itera- Step Quotient Divisor Remainder tion 0 init 0000 0010 0000 0000 0111 1 1 0000 0010 0000 1110 0111 2b 0000 0010 0000 0000 0111 3 0000 0001 0000 0000 0111 2 … 3 4 5 Example: 0111 / 0010: Algorithm

Divide Version 1 Divisor register, remainder register, ALU are 64-bit wide; quotient register is 32-bit wide Algorithm 32-bit divisor starts at left half of divisor register Remainder register is initialized with the dividend at right Why 33? Quotient register is initialized to be 0

Divide Version 1

Observations on Divide Version 1 Half the bits in divisor always 0  1/2 of 64-bit adder is wasted  1/2 of divisor register is wasted Intuition: instead of shifting divisor to right, shift remainder to left… Step 1 cannot produce a 1 in quotient bit – as all bits corresponding to the divisor in the remainder register are 0 (remember all operands are 32-bit) Intuition: switch order to shift first and then subtract – can save 1 iteration…

Divide Version 2 Divisor register, quotient register, ALU are 32-bit wide; remainder register is 64-bit wide Remainder register is initialized with the dividend at right Algorithm Why this correction step?

Observations on Divide Version 2 Each step the remainder register wastes space that exactly matches the current size of the quotient Intuition: combine quotient register and remainder register…

Divide Version 3 No separate quotient register; quotient is entered on the right side of the 64-bit remainder register Algorithm Remainder register is initialized with the dividend at right Why this correction step?

Divide Version 3 Example: 0111 / 0010: Itera- Step Divisor Remainder tion 0 init 0010 0000 0111 1 0010 0000 1110 1 2 0010 1110 1110 3b 0010 0001 1100 2 … 3 4 Algorithm

Observations on Divide Version 3 Same hardware as Multiply Version 3 Signed divide: make both divisor and dividend positive and perform division negate the quotient if divisor and dividend were of opposite signs make the sign of the remainder match that of the dividend this ensures always dividend = (quotient * divisor) + remainder –quotient (x/y) = quotient (–x/y) (e.g. 7 = 3*2 + 1 & –7 = –3*2 – 1)

 Divide generates the reminder in hi and the quotient in lo div $s0, $s1 # lo = $s0 / $s1 # hi = $s0 mod $s1 l Instructions mfhi rd and mflo rd are provided to move the quotient and reminder to (user accessible) registers in the register file MIPS Divide Instruction  As with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0. op rs rt rd shamt funct

MIPS Notes div rs, rt (signed), divu rs, rt (unsigned), with two 32-bit register operands, divide the contents of the operands and put remainder in Hi register and quotient in Lo ; overflow is ignored in both cases pseudo-instructions div rdest, rsrc1, src2 (signed with overflow), divu rdest, rsrc1, src2 (unsigned without overflow) with three 32-bit register operands puts quotients of two registers into third

Floating Point

Review of Numbers Computers are made to deal with numbers What can we represent in N bits? Unsigned integers: 0to2 N - 1 Signed Integers (Two’s Complement) -2 (N-1) to 2 (N-1) - 1

Other Numbers What about other numbers? Very large numbers? (seconds/century) 3,155,760,000 10 (3.15576 10 x 10 9 ) Very small numbers? (atomic diameter) 0.00000001 10 (1.0 10 x 10 -8 ) Rationals (repeating pattern) 2/3 (0.666666666...) Irrationals 2 1/2 (1.414213562373...) Transcendentals e (2.718...),  (3.141...) All represented in scientific notation

Scientific Notation (in Decimal) 6.02 10 x 10 23 radix (base) decimal pointmantissa exponent Normalized form: no leadings 0s (exactly one digit to left of decimal point) Alternatives to representing 1/1,000,000,000 Normalized: 1.0 x 10 -9 Not normalized: 0.1 x 10 -8,10.0 x 10 -10

Scientific Notation (in Binary) 1.0 two x 2 -1 radix (base) “binary point” exponent Computer arithmetic that supports it called floating point, because it represents numbers where the binary point is not fixed, as it is for integers Declare such variable in C as float mantissa

Floating Point Representation (1/2) Normal format: +1.xxxxxxxxxx two *2 yyyy two Multiple of Word Size (32 bits) 0 31 SExponent 302322 Significand 1 bit8 bits23 bits S represents Sign Exponent represents y’s Significand represents x’s Represent numbers as small as 2.0 x 10 -38 to as large as 2.0 x 10 38

Floating Point Representation (2/2) What if result too large? (> 2.0x10 38 ) Overflow! Overflow  Exponent larger than represented in 8-bit Exponent field What if result too small? (>0, < 2.0x10 -38 ) Underflow! Underflow  Negative exponent larger than represented in 8-bit Exponent field How to reduce chances of overflow or underflow?

Double Precision Fl. Pt. Representation Next Multiple of Word Size (64 bits) Double Precision (vs. Single Precision) C variable declared as double Represent numbers almost as small as 2.0 x 10 -308 to almost as large as 2.0 x 10 308 But primary advantage is greater accuracy due to larger significand 0 31 SExponent 302019 Significand 1 bit11 bits20 bits Significand (cont’d) 32 bits

IEEE 754 Floating Point Standard (1/4) Single Precision, DP similar Sign bit:1 means negative 0 means positive Significand: To pack more bits, leading 1 implicit for normalized numbers 1 + 23 bits single, 1 + 52 bits double always true: Significand < 1 (for normalized numbers) Note: 0 has no leading 1, so reserve exponent value 0 just for number 0

IEEE 754 Floating Point Standard (2/4) Want FP numbers to be used even if no FP hardware; e.g., sort records with FP numbers using integer compares Could break FP number into 3 parts: compare signs, then compare exponents, then compare significands Then want order: Highest order bit is sign ( negative < positive) Exponent next, so big exponent => bigger # Significand last: exponents same => bigger #

IEEE 754 Floating Point Standard (3/4) Negative Exponent? 2’s comp? 1.0 x 2 -1 v. 1.0 x2 +1 (1/2 v. 2) 01111 000 0000 0000 0000 0000 0000 1/2 00000 0001000 0000 0000 0000 0000 0000 2 This notation using integer compare of 1/2 v. 2 makes 1/2 > 2! Instead, pick notation 0000 0001 is most negative, and 1111 1111 is most positive 1.0 x 2 -1 v. 1.0 x2 +1 (1/2 v. 2) 1/2 00111 1110000 0000 0000 0000 0000 0000 01000 0000000 0000 0000 0000 0000 0000 2

IEEE 754 Floating Point Standard (4/4) Called Biased Notation, where bias is number subtracted to get real number IEEE 754 uses bias of 127 for single prec. Subtract 127 from Exponent field to get actual value for exponent 1023 is bias for double precision Summary (single precision): 0 31 SExponent 302322 Significand 1 bit8 bits23 bits (-1) S x (1 + Significand) x 2 (Exponent-127) Double precision identical, except with exponent bias of 1023

Floating Point numbers approximate values that we want to use. IEEE 754 Floating Point Standard is most widely accepted attempt to standardize interpretation of such numbers Every desktop or server computer sold since ~1997 follows these conventions Summary (single precision): 0 31 SExponent 302322 Significand 1 bit8 bits23 bits (-1) S x (1 + Significand) x 2 (Exponent-127) Double precision identical, bias of 1023

Example: Converting Binary FP to Decimal Sign: 0 => positive Exponent: 0110 1000 two = 104 ten Bias adjustment: 104 - 127 = -23 Significand: 1 + 1x2 -1 + 0x2 -2 + 1x2 -3 + 0x2 -4 + 1x2 -5 +... =1+2 -1 +2 -3 +2 -5 +2 -7 +2 -9 +2 -14 +2 -15 +2 -17 +2 -22 = 1.0 ten + 0.666115 ten 00110 1000101 0101 0100 0011 0100 0010 Represents: 1.666115 ten *2 -23 ~ 1.986*10 -7 (about 2/10,000,000)

Converting Decimal to FP (1/3) Simple Case: If denominator is an exponent of 2 (2, 4, 8, 16, etc.), then it’s easy. Show MIPS representation of -0.75 -0.75 = -3/4 -11 two /100 two = -0.11 two Normalized to -1.1 two x 2 -1 (-1) S x (1 + Significand) x 2 (Exponent-127) (-1) 1 x (1 +.100 0000... 0000) x 2 (126-127) 10111 1110100 0000 0000 0000 0000 0000

Converting Decimal to FP (2/3) Not So Simple Case: If denominator is not an exponent of 2. Then we can’t represent number precisely, but that’s why we have so many bits in significand: for precision Once we have significand, normalizing a number to get the exponent is easy. So how do we get the significand of a neverending number?

Converting Decimal to FP (3/3) Fact: All rational numbers have a repeating pattern when written out in decimal. Fact: This still applies in binary. To finish conversion: Write out binary number with repeating pattern. Cut it off after correct number of bits (different for single v. double precision). Derive Sign, Exponent and Significand fields.

Example: Representing 1/3 in MIPS 1/3 = 0.33333… 10 = 0.25 + 0.0625 + 0.015625 + 0.00390625 + … = 1/4 + 1/16 + 1/64 + 1/256 + … = 2 -2 + 2 -4 + 2 -6 + 2 -8 + … = 0.0101010101… 2 * 2 0 = 1.0101010101… 2 * 2 -2 Sign: 0 Exponent = -2 + 127 = 125 = 01111101 Significand = 0101010101… 00111 11010101 0101 0101 0101 0101 010

Representation for ± ∞ In FP, divide by 0 should produce ± ∞, not overflow. Why? OK to do further computations with ∞ E.g., X/0 > Y may be a valid comparison IEEE 754 represents ± ∞ Most positive exponent reserved for ∞ Significands all zeroes

Representation for 0 Represent 0? exponent all zeroes significand all zeroes too What about sign? +0: 0 00000000 00000000000000000000000 -0: 1 00000000 00000000000000000000000 Why two zeroes? Helps in some limit comparisons

Special Numbers What have we defined so far? (Single Precision) ExponentSignificandObject 000 0nonzero??? 1-254anything+/- fl. pt. # 2550+/- ∞ 255nonzero???

Representation for Not a Number What is sqrt(-4.0) or 0/0 ? If ∞ not an error, these shouldn’t be either. Called Not a Number (NaN) Exponent = 255, Significand nonzero Why is this useful? Hope NaNs help with debugging? They contaminate: op(NaN, X) = NaN

Floating Point Addition Algorithm:

Floating Point Addition Hardware:

Floating Point Multpication Algorithm:

MIPS Floating Point Architecture (1/4) Separate floating point instructions: Single Precision: add.s, sub.s, mul.s, div.s Double Precision: add.d, sub.d, mul.d, div.d These are far more complicated than their integer counterparts Can take much longer to execute

MIPS Floating Point Architecture (2/4) Problems: Inefficient to have different instructions take vastly differing amounts of time. Generally, a particular piece of data will not change FP  int within a program. -Only 1 type of instruction will be used on it. Some programs do no FP calculations It takes lots of hardware relative to integers to do FP fast

MIPS Floating Point Architecture (3/4) 1990 Solution: Make a completely separate chip that handles only FP. Coprocessor 1: FP chip contains 32 32-bit registers: $f0, $f1, … most of the registers specified in.s and.d instruction refer to this set separate load and store: lwc1 and swc1 (“load word coprocessor 1”, “store …”) Double Precision: by convention, even/odd pair contain one DP FP number: $f0 / $f1, $f2 / $f3, …, $f30 / $f31 -Even register is the name

MIPS Floating Point Architecture (4/4) Today, FP coprocessor integrated with CPU, or cheap chips may leave out FP HW Instructions to move data between main processor and coprocessors: mfc0, mtc0, mfc1, mtc1, etc. Appendix contains many more FP ops

Performance evaluation

Performance Metrics  Purchasing perspective l given a collection of machines, which has the -best performance ? -least cost ? -best cost/performance?  Design perspective l faced with design options, which has the -best performance improvement ? -least cost ? -best cost/performance?  Both require l basis for comparison l metric for evaluation

Two Notions of “Performance” Plane Boeing 747 BAD/Sud Concorde Top Speed DC to ParisPassengers Throughput (pass x mph) 610 mph6.5 hours470286,7001350 mph3 hours132178,200 Which has higher performance? Time to deliver 1 passenger? Time to deliver 400 passengers? In a computer, time for 1 job called Response Time or Execution Time In a computer, jobs per day called Throughput or Bandwidth We will focus primarily on execution time for a single job

Performance Metrics  Normally we are interested in reducing Response time (aka execution time) – the time between the start and the completion of a task  Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors  Additional Metrics l Smallest/lightest l Longest battery life l Most reliable/durable (in space)

What is Time?  Straightforward definition of time: l Total time to complete a task, including disk accesses, memory accesses, I/O activities, operating system overhead,... l “real time”, “response time” or “elapsed time”  Alternative: just time processor (CPU) is working only on your program (since multiple processes running at same time) l “CPU execution time” or “CPU time” l Often divided into system CPU time (in OS) and user CPU time (in user program)

Performance Factors  Want to distinguish elapsed time and the time spent on our task  CPU execution time (CPU time) – time the CPU spends working on a task l Does not include time waiting for I/O or running other programs CPU execution time # CPU clock cycles for a program for a program = x clock cycle time CPU execution time # CPU clock cycles for a program for a program clock rate = -------------------------------------------  Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program  Many techniques that decrease the number of clock cycles also increase the clock cycle time or

Review: Machine Clock Rate  Clock rate (MHz, GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate

Clock Cycles per Instruction  Not all instructions take the same amount of time to execute l One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction  Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute l A way to compare two different implementations of the same ISA # CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x CPI for this instruction class ABC CPI123

Effective CPI  Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI =  (CPI i x IC i ) i = 1 n l Where IC i is the count (percentage) of the number of instructions of class i executed l CPI i is the (average) number of clock cycles per instruction for that instruction class l n is the number of instruction classes  The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

THE Performance Equation  Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle Instruction_count x CPI clock_rate CPU time = ----------------------------------------------- or  These equations separate the three key factors that affect performance l Can measure the CPU execution time by running the program l The clock rate is usually given l Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details l CPI varies by instruction type and ISA implementation for which we must know the implementation details instruction count != the number of lines in the code

Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_ count CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology

Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_ count CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology X XX XX XX X X X X X

A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time?  What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2  =

A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time? OpFreqCPI i Freq x CPI i ALU50%1 Load20%5 Store10%3 Branch20%2  =.5 1.0.3.4 2.2 CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster 1.6.5.4.3.4.5 1.0.3.2 2.0 CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

Comparing and Summarizing Performance  Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))  How do we summarize the performance for benchmark set with a single number? l The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) AM = 1/n  Time i i = 1 n l Where Time i is the execution time for the i th program of a total of n programs in the workload l A smaller mean indicates a smaller average execution time and thus improved performance

SPEC Benchmarks www.spec.orgwww.spec.org Integer benchmarksFP benchmarks gzipcompressionwupwiseQuantum chromodynamics vprFPGA place & routeswimShallow water model gccGNU C compilermgridMultigrid solver in 3D fields mcfCombinatorial optimizationappluParabolic/elliptic pde craftyChess programmesa3D graphics library parserWord processing programgalgelComputational fluid dynamics eonComputer visualizationartImage recognition (NN) perlbmkperl applicationequakeSeismic wave propagation simulation gapGroup theory interpreterfacerecFacial image recognition vortexObject oriented databaseammpComputational chemistry bzip2compressionlucasPrimality testing twolfCircuit place & routefma3dCrash simulation fem sixtrackNuclear physics accel apsiPollutant distribution

Example SPEC Ratings

Other Performance Metrics  Power consumption – especially in the embedded market where battery life is important (and passive cooling) l For power-limited applications, the most important metric is energy efficiency

Summary: Evaluating ISAs  Design-time metrics: l Can it be implemented, in how long, at what cost? l Can it be programmed? Ease of compilation?  Static Metrics: l How many bytes does the program occupy in memory?  Dynamic Metrics: l How many instructions are executed? How many bytes does the processor fetch to execute the program? l How many clocks are required per instruction? Best Metric: Time to execute the program! CPI Inst. CountCycle Time depends on the instructions set, the processor organization, and compilation techniques.

Example A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % What is the average CPI for this session?

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % What is the average CPI for this session? Answer: (20% · 4) + (30% · 3) + (40% · 1) + (10% · 2) = 2.3.

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % How much CPU time was used in this session?

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % How much CPU time was used in this session? Answer: (200 x 10 6 instructions) · (2 ns/cycle) · (2.3 cycles/instruction) = 0.92 seconds.

Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Similar presentations

Presentation on theme: "Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Similar presentations

Presentation on theme: "Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]"— Presentation transcript:

Similar presentations

About project

Feedback