Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]
ALU Implementation
Review: Basic Hardware
To warm up let's build a logic unit to support the and and or instructions for MIPS (32-bit registers) We'll just build a 1-bit unit and use 32 of them Possible implementation using a multiplexor : A Simple Multi-Function Logic Unit a b output operation selector
Selects one of the inputs to be the output based on a control input Implementation with a Multiplexor units
Not easy to decide the best way to implement do not want too many inputs to a single gate do not want to have to go through too many gates (= levels) Let's look at a 1-bit ALU for addition: Implementations c out = a.b + a.c in + b.c in sum = a.b.c in + a.b.c in + a.b.c in + a.b.c in = a b c in exclusive or (xor)
Building a 1-bit Binary Adder 1 bit Full Adder A B S carry_in carry_out S = A xor B xor carry_in carry_out = A B v A carry_in v B carry_in (majority function) How can we use it to build a 32-bit adder? How can we modify it easily to build an adder/subtractor? ABcarry_incarry_outS
1-bit Adder Logic Half-adder with one xor gate Full-adder from 2 half-adders and an or gate Half-adder with the xor gate replaced by primitive gates using the equation A B = A.B +A.B xor
Building a 32-bit ALU Ripple-Carry Logic for 32-bit ALU 1-bit ALU for AND, OR and add Multiplexor control line
Two's complement approach: just negate b and add. How do we negate? recall negation shortcut : invert each bit of b and set CarryIn to least significant bit (ALU0) to 1 What about Subtraction (a – b) ?
Overflow Detection and Effects Overflow: the result is too large to represent in the number of bits allocated When adding operands with different signs, overflow cannot occur! Overflow occurs when l adding two positives yields a negative l or, adding two negatives gives a positive l or, subtract a negative from a positive gives a negative l or, subtract a positive from a negative gives a positive On overflow, an exception (interrupt) occurs l Control jumps to predefined address for exception l Interrupted address (address of instruction causing the overflow) is saved for possible resumption Don't always want to detect (interrupt on) overflow
Overflow Detection Overflow can be detected by: l Carry into MSB xor Carry out of MSB Logic for overflow detection therefore can be put in to ALU – –4 –
New MIPS Instructions CategoryInstrOp CodeExampleMeaning Arithmetic (R & I format) add unsigned0 and 21addu $s1, $s2, $s3$s1 = $s2 + $s3 sub unsigned0 and 23subu $s1, $s2, $s3$s1 = $s2 - $s3 add imm.unsigned 9addiu $s1, $s2, 6$s1 = $s2 + 6 Data Transfer ld byte unsigned 24lbu $s1, 25($s2)$s1 = Mem($s2+25) ld half unsigned25lhu $s1, 25($s2)$s1 = Mem($s2+25) Cond. Branch (I & R format) set on less than unsigned 0 and 2bsltu $s1, $s2, $s3if ($s2<$s3) $s1=1 else $s1=0 set on less than imm unsigned bsltiu $s1, $s2, 6if ($s2<6) $s1=1 else $s1=0 Sign extend – addiu, addiu, slti, sltiu Zero extend – andi, ori, xori Overflow detected – add, addi, sub
Tailoring the ALU to MIPS: Test for Less-than and Equality Need to support the set-on-less-than instruction e.g., slt $t0, $t3, $t4 remember: slt is an R-type instruction that produces 1 if rs < rt and 0 otherwise idea is to use subtraction: rs < rt rs – rt < 0. Recall msb of negative number is 1 two cases after subtraction rs – rt: if no overflow then rs < rt most significant bit of rs – rt = 1 if overflow then rs < rt most significant bit of rs – rt = 0 why? e.g., 5 ten – 6 ten = 0101 – 0110 = = 1111 (ok!) -7 ten – 6 ten = 1001 – 0110 = = 0011 (overflow!) set bit is sent from ALU31 to ALU0 as the Less bit at ALU0; all other Less bits are hardwired 0; so Less is the 32-bit result of slt
Supporting slt 1- bit ALU for the 31 least significant bits 1-bit ALU for the most significant bit Extra set bit, to be routed to the Less input of the least significant 1-bit ALU, is computed from the most significant Result bit and the Overflow bit (it is not the output of the adder as the figure seems to indicate) Less input of the 31 most significant ALUs is always 0 32-bit ALU from 31 copies of ALU at top left and 1 copy of ALU at bottom left in the most significant position
Tailoring the ALU to MIPS: Test for Less-than and Equality Need to support test for equality e.g., beq $t5, $t6, $t7 use subtraction: rs - rt = 0 rs = rt do we need to consider overflow?
Supporting Test for Equality ALU control lines Bneg- Oper- Func- ate ation tion 0 00 and 0 01 or 0 10 add 1 10 sub 1 11 slt Symbol representing ALU Output is 1 only if all Result bits are 0 Combine CarryIn to least significant ALU and Binvert to a single control line as both are always either 1 or 0 32-bit MIPS ALU
Conclusion We can build an ALU to support the MIPS instruction set key idea: use multiplexor to select the output we want we can efficiently perform subtraction using two’s complement we can replicate a 1-bit ALU to produce a 32-bit ALU Important points about hardware speed of a gate depends number of inputs (fan-in) to the gate speed of a circuit depends on number of gates in series (particularly, on the critical path to the deepest level of logic) Speed of MIPS operations clever changes to organization can improve performance (similar to using better algorithms in software) we’ll look at examples for addition, multiplication and division
Building a 32-bit ALU Ripple-Carry Logic for 32-bit ALU 1-bit ALU for AND, OR and add Multiplexor control line
Is a 32-bit ALU as fast as a 1-bit ALU? Why? Is there more than one way to do addition? Yes: one extreme: ripple-carry – carry ripples through 32 ALUs, slow! other extreme: sum-of-products for each CarryIn bit – super fast! CarryIn bits: c 1 = b 0.c 0 + a 0.c 0 + a 0.b 0 c 2 = b 1.c 1 + a 1.c 1 + a 1.b 1 = a 1.a 0.b 0 + a 1.a 0.c 0 + a 1.b 0.c 0 (substituting for c 1 ) + b 1.a 0.b 0 + b 1.a 0.c 0 + b 1.b 0.c 0 + a 1.b 1 c 3 = b 2.c 2 + a 2.c 2 + a 2.b 2 = … = sum of 15 4-term products… How fast? But not feasible for a 32-bit ALU! Why? Exponential complexity!! Problem: Ripple-carry Adder is Slow Note: c i is CarryIn bit into i th ALU; c 0 is the forced CarryIn into the least significant ALU
An approach between our two extremes Motivation: if we didn't know the value of a carry-in, what could we do? when would we always generate a carry? (generate) g i = a i. b i when would we propagate the carry? (propagate) p i = a i + b i Express (carry-in equations in terms of generate/propagates) c 1 = g 0 + p 0.c 0 c 2 = g 1 + p 1.c 1 = g 1 + p 1.g 0 + p 1.p 0.c 0 c 3 = g 2 + p 2.c 2 = g 2 + p 2.g 1 + p 2.p 1.g 0 + p 2.p 1.p 0.c 0 c 4 = g 3 + p 3.c 3 = g 3 + p 3.g 2 + p 3.p 2.g 1 + p 3.p 2.p 1.g 0 + p 3.p 2.p 1.p 0.c 0 Feasible for 4-bit adders – with wider adders unacceptable complexity. solution: build a first level using 4-bit adders, then a second level on top Two-level Carry-lookahead Adder: First Level
Two-level Carry-lookahead Adder: Second Level for a 16-bit adder Propagate signals for each of the four 4-bit adder blocks: P 0 = p 3.p 2.p 1.p 0 P 1 = p 7.p 6.p 5.p 4 P 2 = p 11.p 10.p 9.p 8 P 3 = p 15.p 14.p 13.p 12 Generate signals for each of the four 4-bit adder blocks: G 0 = g 3 + p 3.g 2 + p 3.p 2.g 1 + p 3.p 2.p 1.g 0 G 1 = g 7 + p 7.g 6 + p 7.p 6.g 5 + p 7.p 6.p 5.g 4 G 2 = g 11 + p 11.g 10 + p 11.p 10.g 9 + p 11.p 10.p 9.g 8 G 3 = g 15 + p 15.g 14 + p 15.p 14.g 13 + p 15.p 14.p 13.g 12
Two-level Carry-lookahead Adder: Second Level for a 16-bit adder CarryIn signals for each of the four 4-bit adder blocks (see earlier carry-in equations in terms of generate/propagates): C 1 = G 0 + P 0.c 0 C 2 = G 1 + P 1.G 0 + P 1.P 0.c 0 C 3 = G 2 + P 2.G 1 + P 2.P 1.G 0 + P 2.P 1.P 0.c 0 C 4 = G 3 + P 3.G 2 + P 3.P 2.G 1 + P 3.P 2.P 1.G 0 + P 3.P 2.P 1.P 0.c 0
CarryIn Result0--3 CarryIn Result4--7 CarryIn Result8--11 CarryIn CarryOut Result CarryIn C1 C2 C3 C4 P0 G0 P1 G1 P2 G2 P3 G3 a0 b0 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a6 b6 a7 b7 a8 b8 a9 b9 a10 b10 a11 b11 a12 b12 a13 b13 a14 b14 a15 b15 Logic to compute C1, C2, C3, C4 4bAdder0 4bAdder1 4bAdder2 4bAdder3 16-bit carry-lookahead adder from four 4-bit adders and one carry-lookahead unit Carry-lookahead Logic ALU0 ALU1 ALU2 ALU3 a1 a0 a2 a3 b1 b0 b2 b3 s0 s1 s2 s3 Logic to compute c1, c2, c3, c4, P0, G0 Blow-up of 4-bit adder: (conceptually) consisting of four 1-bit ALUs plus logic to compute all CarryOut bits and one super generate and one super propagate bit. Each 1-bit ALU is exactly as for ripple-carry except c1, c2, c3 for ALUs 1, 2, 3 comes from the extra logic CarryIn Carry-lookahead Unit
Two-level carry-lookahead logic steps: 1. compute p i ’s and g i ’s at each 1-bit ALU 2. compute P i ’s and G i ’s at each 4-bit adder unit 3. compute C i ’s in carry-lookahead unit 4. compute c i ’s at each 4-bit adder unit 5. compute results (sum bits) at each 1-bit ALU Two-level Carry-lookahead Adder: Second Level for a 16-bit adder
Multiplication Standard shift-add method: Multiplicand 1000 Multiplier Product m bits x n bits = m+n bit product Binary makes it easy: multiplier bit 1 => copy multiplicand (1 x multiplicand) multiplier bit 0 => place 0 (0 x multiplicand) 3 versions of multiply hardware & algorithms: x
Shift-add Multiplier Version 1 Algorithm Multiplicand 1000 Multiplier Product
Shift-add Multiplier Version 1 Multiplicand register, product register, ALU are 64-bit wide; multiplier register is 32-bit wide Algorithm 32-bit multiplicand starts at right half of multiplicand register Product register is initialized at 0
Shift-add Multiplier Version1 Itera Step Multiplier Multiplicand Product -tion 0 init values 1 1a … Example: 0010 * 0011: Algorithm
Observations on Multiply Version 1 1 step per clock cycle nearly 100 clock cycles to multiply two 32-bit numbers Half the bits in the multiplicand register always 0 64-bit adder is wasted 0’s inserted to right as multiplicand is shifted left least significant bits of product never change once formed Intuition: instead of shifting multiplicand to left, shift product to right…
Shift-add Multiplier Version 2 Algorithm Multiplicand 1000 Multiplier Product
Shift-add Multiplier Version 2 Multiplicand register, multiplier register, ALU are 32-bit wide; product register is 64-bit wide; multiplicand adds to left half of product register Algorithm Product register is initialized at 0
Shift-add Multiplier Version 2 Itera Step Multiplier Multiplicand Product -tion 0 init values 1 1a … Example: 0010 * 0011: Algorithm
Observations on Multiply Version 2 Each step the product register wastes space that exactly matches the current size of the multiplier Intuition: combine multiplier register and product register…
Shift-add Multiplier Version 3 No separate multiplier register; multiplier placed on right side of 64-bit product register Algorithm Product register is initialized with multiplier on right
Shift-add Multiplier Version 3 Itera Step Multiplicand Product -tion 0 init values 1 1a … Example: 0010 * 0011: Algorithm
Observations on Multiply Version 3 2 steps per bit because multiplier & product combined What about signed multiplication? easiest solution is to make both positive and remember whether to negate product when done, i.e., leave out the sign bit, run for 31 steps, then negate if multiplier and multiplicand have opposite signs Booth’s Algorithm is an elegant way to multiply signed numbers using same hardware – it also often quicker…
MIPS Notes MIPS provides two 32-bit registers Hi and Lo to hold a 64-bit product mult, multu (unsigned) put the product of two 32-bit register operands into Hi and Lo : overflow is ignored by MIPS but can be detected by programmer by examining contents of Hi mflo, mfhi moves content of Hi or Lo to a general-purpose register Pseudo-instructions mul (without overflow), mulo (with overflow), mulou (unsigned with overflow) take three 32-bit register operands, putting the product of two registers into the third
Divide 1001 Quotient Divisor Dividend – – Remainder standart method: see how big a multiple of the divisor can be subtracted, creating quotient digit at each step Binary makes it easy first, try 1 * divisor; if too big, 0 * divisor Dividend = (Quotient * Divisor) + Remainder 3 versions of divide hardware & algorithm:
Divide Version 1 Itera- Step Quotient Divisor Remainder tion 0 init b … Example: 0111 / 0010: Algorithm
Divide Version 1 Divisor register, remainder register, ALU are 64-bit wide; quotient register is 32-bit wide Algorithm 32-bit divisor starts at left half of divisor register Remainder register is initialized with the dividend at right Why 33? Quotient register is initialized to be 0
Divide Version 1
Observations on Divide Version 1 Half the bits in divisor always 0 1/2 of 64-bit adder is wasted 1/2 of divisor register is wasted Intuition: instead of shifting divisor to right, shift remainder to left… Step 1 cannot produce a 1 in quotient bit – as all bits corresponding to the divisor in the remainder register are 0 (remember all operands are 32-bit) Intuition: switch order to shift first and then subtract – can save 1 iteration…
Divide Version 2 Divisor register, quotient register, ALU are 32-bit wide; remainder register is 64-bit wide Remainder register is initialized with the dividend at right Algorithm Why this correction step?
Observations on Divide Version 2 Each step the remainder register wastes space that exactly matches the current size of the quotient Intuition: combine quotient register and remainder register…
Divide Version 3 No separate quotient register; quotient is entered on the right side of the 64-bit remainder register Algorithm Remainder register is initialized with the dividend at right Why this correction step?
Divide Version 3 Example: 0111 / 0010: Itera- Step Divisor Remainder tion 0 init b … 3 4 Algorithm
Observations on Divide Version 3 Same hardware as Multiply Version 3 Signed divide: make both divisor and dividend positive and perform division negate the quotient if divisor and dividend were of opposite signs make the sign of the remainder match that of the dividend this ensures always dividend = (quotient * divisor) + remainder –quotient (x/y) = quotient (–x/y) (e.g. 7 = 3*2 + 1 & –7 = –3*2 – 1)
Divide generates the reminder in hi and the quotient in lo div $s0, $s1 # lo = $s0 / $s1 # hi = $s0 mod $s1 l Instructions mfhi rd and mflo rd are provided to move the quotient and reminder to (user accessible) registers in the register file MIPS Divide Instruction As with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0. op rs rt rd shamt funct
MIPS Notes div rs, rt (signed), divu rs, rt (unsigned), with two 32-bit register operands, divide the contents of the operands and put remainder in Hi register and quotient in Lo ; overflow is ignored in both cases pseudo-instructions div rdest, rsrc1, src2 (signed with overflow), divu rdest, rsrc1, src2 (unsigned without overflow) with three 32-bit register operands puts quotients of two registers into third
Floating Point
Review of Numbers Computers are made to deal with numbers What can we represent in N bits? Unsigned integers: 0to2 N - 1 Signed Integers (Two’s Complement) -2 (N-1) to 2 (N-1) - 1
Other Numbers What about other numbers? Very large numbers? (seconds/century) 3,155,760, ( x 10 9 ) Very small numbers? (atomic diameter) ( x ) Rationals (repeating pattern) 2/3 ( ) Irrationals 2 1/2 ( ) Transcendentals e ( ), ( ) All represented in scientific notation
Scientific Notation (in Decimal) x radix (base) decimal pointmantissa exponent Normalized form: no leadings 0s (exactly one digit to left of decimal point) Alternatives to representing 1/1,000,000,000 Normalized: 1.0 x Not normalized: 0.1 x 10 -8,10.0 x
Scientific Notation (in Binary) 1.0 two x 2 -1 radix (base) “binary point” exponent Computer arithmetic that supports it called floating point, because it represents numbers where the binary point is not fixed, as it is for integers Declare such variable in C as float mantissa
Floating Point Representation (1/2) Normal format: +1.xxxxxxxxxx two *2 yyyy two Multiple of Word Size (32 bits) 0 31 SExponent Significand 1 bit8 bits23 bits S represents Sign Exponent represents y’s Significand represents x’s Represent numbers as small as 2.0 x to as large as 2.0 x 10 38
Floating Point Representation (2/2) What if result too large? (> 2.0x10 38 ) Overflow! Overflow Exponent larger than represented in 8-bit Exponent field What if result too small? (>0, < 2.0x ) Underflow! Underflow Negative exponent larger than represented in 8-bit Exponent field How to reduce chances of overflow or underflow?
Double Precision Fl. Pt. Representation Next Multiple of Word Size (64 bits) Double Precision (vs. Single Precision) C variable declared as double Represent numbers almost as small as 2.0 x to almost as large as 2.0 x But primary advantage is greater accuracy due to larger significand 0 31 SExponent Significand 1 bit11 bits20 bits Significand (cont’d) 32 bits
IEEE 754 Floating Point Standard (1/4) Single Precision, DP similar Sign bit:1 means negative 0 means positive Significand: To pack more bits, leading 1 implicit for normalized numbers bits single, bits double always true: Significand < 1 (for normalized numbers) Note: 0 has no leading 1, so reserve exponent value 0 just for number 0
IEEE 754 Floating Point Standard (2/4) Want FP numbers to be used even if no FP hardware; e.g., sort records with FP numbers using integer compares Could break FP number into 3 parts: compare signs, then compare exponents, then compare significands Then want order: Highest order bit is sign ( negative < positive) Exponent next, so big exponent => bigger # Significand last: exponents same => bigger #
IEEE 754 Floating Point Standard (3/4) Negative Exponent? 2’s comp? 1.0 x 2 -1 v. 1.0 x2 +1 (1/2 v. 2) / This notation using integer compare of 1/2 v. 2 makes 1/2 > 2! Instead, pick notation is most negative, and is most positive 1.0 x 2 -1 v. 1.0 x2 +1 (1/2 v. 2) 1/
IEEE 754 Floating Point Standard (4/4) Called Biased Notation, where bias is number subtracted to get real number IEEE 754 uses bias of 127 for single prec. Subtract 127 from Exponent field to get actual value for exponent 1023 is bias for double precision Summary (single precision): 0 31 SExponent Significand 1 bit8 bits23 bits (-1) S x (1 + Significand) x 2 (Exponent-127) Double precision identical, except with exponent bias of 1023
Floating Point numbers approximate values that we want to use. IEEE 754 Floating Point Standard is most widely accepted attempt to standardize interpretation of such numbers Every desktop or server computer sold since ~1997 follows these conventions Summary (single precision): 0 31 SExponent Significand 1 bit8 bits23 bits (-1) S x (1 + Significand) x 2 (Exponent-127) Double precision identical, bias of 1023
Example: Converting Binary FP to Decimal Sign: 0 => positive Exponent: two = 104 ten Bias adjustment: = -23 Significand: 1 + 1x x x x x = = 1.0 ten ten Represents: ten *2 -23 ~ 1.986*10 -7 (about 2/10,000,000)
Converting Decimal to FP (1/3) Simple Case: If denominator is an exponent of 2 (2, 4, 8, 16, etc.), then it’s easy. Show MIPS representation of = -3/4 -11 two /100 two = two Normalized to -1.1 two x 2 -1 (-1) S x (1 + Significand) x 2 (Exponent-127) (-1) 1 x ( ) x 2 ( )
Converting Decimal to FP (2/3) Not So Simple Case: If denominator is not an exponent of 2. Then we can’t represent number precisely, but that’s why we have so many bits in significand: for precision Once we have significand, normalizing a number to get the exponent is easy. So how do we get the significand of a neverending number?
Converting Decimal to FP (3/3) Fact: All rational numbers have a repeating pattern when written out in decimal. Fact: This still applies in binary. To finish conversion: Write out binary number with repeating pattern. Cut it off after correct number of bits (different for single v. double precision). Derive Sign, Exponent and Significand fields.
Example: Representing 1/3 in MIPS 1/3 = … 10 = … = 1/4 + 1/16 + 1/64 + 1/256 + … = … = … 2 * 2 0 = … 2 * 2 -2 Sign: 0 Exponent = = 125 = Significand = …
Representation for ± ∞ In FP, divide by 0 should produce ± ∞, not overflow. Why? OK to do further computations with ∞ E.g., X/0 > Y may be a valid comparison IEEE 754 represents ± ∞ Most positive exponent reserved for ∞ Significands all zeroes
Representation for 0 Represent 0? exponent all zeroes significand all zeroes too What about sign? +0: : Why two zeroes? Helps in some limit comparisons
Special Numbers What have we defined so far? (Single Precision) ExponentSignificandObject 000 0nonzero??? 1-254anything+/- fl. pt. # 2550+/- ∞ 255nonzero???
Representation for Not a Number What is sqrt(-4.0) or 0/0 ? If ∞ not an error, these shouldn’t be either. Called Not a Number (NaN) Exponent = 255, Significand nonzero Why is this useful? Hope NaNs help with debugging? They contaminate: op(NaN, X) = NaN
Floating Point Addition Algorithm:
Floating Point Addition Hardware:
Floating Point Multpication Algorithm:
MIPS Floating Point Architecture (1/4) Separate floating point instructions: Single Precision: add.s, sub.s, mul.s, div.s Double Precision: add.d, sub.d, mul.d, div.d These are far more complicated than their integer counterparts Can take much longer to execute
MIPS Floating Point Architecture (2/4) Problems: Inefficient to have different instructions take vastly differing amounts of time. Generally, a particular piece of data will not change FP int within a program. -Only 1 type of instruction will be used on it. Some programs do no FP calculations It takes lots of hardware relative to integers to do FP fast
MIPS Floating Point Architecture (3/4) 1990 Solution: Make a completely separate chip that handles only FP. Coprocessor 1: FP chip contains bit registers: $f0, $f1, … most of the registers specified in.s and.d instruction refer to this set separate load and store: lwc1 and swc1 (“load word coprocessor 1”, “store …”) Double Precision: by convention, even/odd pair contain one DP FP number: $f0 / $f1, $f2 / $f3, …, $f30 / $f31 -Even register is the name
MIPS Floating Point Architecture (4/4) Today, FP coprocessor integrated with CPU, or cheap chips may leave out FP HW Instructions to move data between main processor and coprocessors: mfc0, mtc0, mfc1, mtc1, etc. Appendix contains many more FP ops
Performance evaluation
Performance Metrics Purchasing perspective l given a collection of machines, which has the -best performance ? -least cost ? -best cost/performance? Design perspective l faced with design options, which has the -best performance improvement ? -least cost ? -best cost/performance? Both require l basis for comparison l metric for evaluation
Two Notions of “Performance” Plane Boeing 747 BAD/Sud Concorde Top Speed DC to ParisPassengers Throughput (pass x mph) 610 mph6.5 hours470286, mph3 hours132178,200 Which has higher performance? Time to deliver 1 passenger? Time to deliver 400 passengers? In a computer, time for 1 job called Response Time or Execution Time In a computer, jobs per day called Throughput or Bandwidth We will focus primarily on execution time for a single job
Performance Metrics Normally we are interested in reducing Response time (aka execution time) – the time between the start and the completion of a task Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors Additional Metrics l Smallest/lightest l Longest battery life l Most reliable/durable (in space)
What is Time? Straightforward definition of time: l Total time to complete a task, including disk accesses, memory accesses, I/O activities, operating system overhead,... l “real time”, “response time” or “elapsed time” Alternative: just time processor (CPU) is working only on your program (since multiple processes running at same time) l “CPU execution time” or “CPU time” l Often divided into system CPU time (in OS) and user CPU time (in user program)
Performance Factors Want to distinguish elapsed time and the time spent on our task CPU execution time (CPU time) – time the CPU spends working on a task l Does not include time waiting for I/O or running other programs CPU execution time # CPU clock cycles for a program for a program = x clock cycle time CPU execution time # CPU clock cycles for a program for a program clock rate = Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program Many techniques that decrease the number of clock cycles also increase the clock cycle time or
Review: Machine Clock Rate Clock rate (MHz, GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate
Clock Cycles per Instruction Not all instructions take the same amount of time to execute l One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute l A way to compare two different implementations of the same ISA # CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x CPI for this instruction class ABC CPI123
Effective CPI Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI = (CPI i x IC i ) i = 1 n l Where IC i is the count (percentage) of the number of instructions of class i executed l CPI i is the (average) number of clock cycles per instruction for that instruction class l n is the number of instruction classes The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs
THE Performance Equation Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle Instruction_count x CPI clock_rate CPU time = or These equations separate the three key factors that affect performance l Can measure the CPU execution time by running the program l The clock rate is usually given l Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details l CPI varies by instruction type and ISA implementation for which we must know the implementation details instruction count != the number of lines in the code
Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_ count CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology
Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_ count CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology X XX XX XX X X X X X
A Simple Example How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2 =
A Simple Example How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? OpFreqCPI i Freq x CPI i ALU50%1 Load20%5 Store10%3 Branch20%2 = CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
Comparing and Summarizing Performance Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) How do we summarize the performance for benchmark set with a single number? l The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) AM = 1/n Time i i = 1 n l Where Time i is the execution time for the i th program of a total of n programs in the workload l A smaller mean indicates a smaller average execution time and thus improved performance
SPEC Benchmarks Integer benchmarksFP benchmarks gzipcompressionwupwiseQuantum chromodynamics vprFPGA place & routeswimShallow water model gccGNU C compilermgridMultigrid solver in 3D fields mcfCombinatorial optimizationappluParabolic/elliptic pde craftyChess programmesa3D graphics library parserWord processing programgalgelComputational fluid dynamics eonComputer visualizationartImage recognition (NN) perlbmkperl applicationequakeSeismic wave propagation simulation gapGroup theory interpreterfacerecFacial image recognition vortexObject oriented databaseammpComputational chemistry bzip2compressionlucasPrimality testing twolfCircuit place & routefma3dCrash simulation fem sixtrackNuclear physics accel apsiPollutant distribution
Example SPEC Ratings
Other Performance Metrics Power consumption – especially in the embedded market where battery life is important (and passive cooling) l For power-limited applications, the most important metric is energy efficiency
Summary: Evaluating ISAs Design-time metrics: l Can it be implemented, in how long, at what cost? l Can it be programmed? Ease of compilation? Static Metrics: l How many bytes does the program occupy in memory? Dynamic Metrics: l How many instructions are executed? How many bytes does the processor fetch to execute the program? l How many clocks are required per instruction? Best Metric: Time to execute the program! CPI Inst. CountCycle Time depends on the instructions set, the processor organization, and compilation techniques.
Example A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % What is the average CPI for this session?
Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % What is the average CPI for this session? Answer: (20% · 4) + (30% · 3) + (40% · 1) + (10% · 2) = 2.3.
Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % How much CPU time was used in this session?
Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % How much CPU time was used in this session? Answer: (200 x 10 6 instructions) · (2 ns/cycle) · (2.3 cycles/instruction) = 0.92 seconds.