Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

Slides:

Advertisements

Similar presentations

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Advertisements

Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

810:142 Lecture 2: Performance Fall 2006 Chapter 4: Performance Adapted from Mary Jane Irwin at Penn State University for Computer Organization and Design,

ECE-C355 Computer Structures Winter 2008 Chapter 04: Understanding Performance Slides are adapted from Professor Mary Jane Irwin (

CSE431 L04 Understanding Performance.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 04: Understanding Performance Mary Jane Irwin (

1 CONSTRUCTING AN ARITHMETIC LOGIC UNIT CHAPTER 4: PART II.

1 CS61C L9 Fl. Pt. © UC Regents CS61C - Machine Structures Lecture 9 - Floating Point, Part I September 27, 2000 David Patterson

Integer Multiplication (1/3)

Chapter 3 Arithmetic for Computers. Multiplication More complicated than addition accomplished via shifting and addition More time and more area Let's.

Lecture 9 Sept 28 Chapter 3 Arithmetic for Computers.

1  2004 Morgan Kaufmann Publishers Chapter Three.

1 Chapter 4: Arithmetic Where we've been: –Performance (seconds, cycles, instructions) –Abstractions: Instruction Set Architecture Assembly Language and.

CS 61C L15 Floating Point I (1) Krause, Spring 2005 © UCB This day in history… TA Danny Krause inst.eecs.berkeley.edu/~cs61c-td inst.eecs.berkeley.edu/~cs61c.

1 Lecture 8: Binary Multiplication & Division Today’s topics:  Addition/Subtraction  Multiplication  Division Reminder: get started early on assignment.

ECE 15B Computer Organization Spring 2010 Dmitri Strukov Lecture 6: Logic/Shift Instructions Partially adapted from Computer Organization and Design, 4.

1 ECE369 Chapter 3. 2 ECE369 Multiplication More complicated than addition –Accomplished via shifting and addition More time and more area.

CS 61C L10 Floating Point (1) A Carle, Summer 2005 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #10: Floating Point

1  1998 Morgan Kaufmann Publishers Chapter Four Arithmetic for Computers.

Chapter 3 Arithmetic for Computers. Arithmetic Where we've been: Abstractions: Instruction Set Architecture Assembly Language and Machine Language What's.

Arithmetic for Computers

IT 251 Computer Organization and Architecture Introduction to Floating Point Numbers Chia-Chi Teng.

1 Bits are just bits (no inherent meaning) — conventions define relationship between bits and numbers Binary numbers (base 2)

Computer Arithmetic Nizamettin AYDIN

Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.

CSE 340 Computer Architecture Summer 2014 Understanding Performance

Csci 136 Computer Architecture II – Floating-Point Arithmetic

CEN 316 Computer Organization and Design Computer Arithmetic Floating Point Dr. Mansour AL Zuair.

IT253: Computer Organization

Computing Systems Basic arithmetic for computers.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Patterson.

07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.

CH09 Computer Arithmetic  CPU combines of ALU and Control Unit, this chapter discusses ALU The Arithmetic and Logic Unit (ALU) Number Systems Integer.

1 EGRE 426 Fall 08 Chapter Three. 2 Arithmetic What's up ahead: –Implementing the Architecture 32 operation result a b ALU.

1  1998 Morgan Kaufmann Publishers Arithmetic Where we've been: –Performance (seconds, cycles, instructions) –Abstractions: Instruction Set Architecture.

Csci 136 Computer Architecture II – Constructing An Arithmetic Logic Unit Xiuzhen Cheng

Computing Systems Designing a basic ALU.

1 Lecture 6 BOOLEAN ALGEBRA and GATES Building a 32 bit processor PH 3: B.1-B.5.

CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.

FLOATING POINT ARITHMETIC P&H Slides ECE 411 – Spring 2005.

CS 61C L3.2.1 Floating Point 1 (1) K. Meinz, Summer 2004 © UCB CS61C : Machine Structures Lecture Floating Point Kurt Meinz inst.eecs.berkeley.edu/~cs61c.

CHAPTER 3 Floating Point.

CDA 3101 Fall 2013 Introduction to Computer Organization

1  2004 Morgan Kaufmann Publishers Performance is specific to a particular program/s –Total execution time is a consistent summary of performance For.

1 ELEN 033 Lecture 4 Chapter 4 of Text (COD2E) Chapters 3 and 4 of Goodman and Miller book.

CS35101 – Computer Architecture Week 9: Understanding Performance Paul Durand ( ) [Adapted from M Irwin (

1  2004 Morgan Kaufmann Publishers Lets Build a Processor Almost ready to move into chapter 5 and start building a processor First, let’s review Boolean.

Computer Arthmetic Chapter Four P&H. Data Representation Why do we not encode numbers as strings of ASCII digits inside computers? What is overflow when.

9/23/2004Comp 120 Fall September Chapter 4 – Arithmetic and its implementation Assignments 5,6 and 7 posted to the class web page.

EE204 L03-ALUHina Anwar Khan EE204 Computer Architecture Lecture 03- ALU.

By Wannarat Computer System Design Lecture 3 Wannarat Suntiamorntut.

1 CPTR 220 Computer Organization Computer Architecture Assembly Programming.

CSE 340 Computer Architecture Summer 2016 Understanding Performance.

1 Chapter 3 Arithmetic for Computers Lecture Slides are from Prof. Jose Delgado-Frias, Mr. Paul Wettin, and Prof. Valeriu Beiu (Washington State University.

Computer System Design Lecture 3

Computer Arthmetic Chapter Four P&H.

Assessing and Understanding Performance

April 2006 Saeid Nooshabadi

Floating Point Arithmetics

CS 61C: Great Ideas in Computer Architecture Floating Point Arithmetic

Computer Arithmetic Multiplication, Floating Point

ECEG-3202 Computer Architecture and Organization

Floating Point Numbers

Morgan Kaufmann Publishers Arithmetic for Computers

CDA 3101 Spring 2016 Introduction to Computer Organization

Presentation transcript:

Computer Architecture and Design – ECEN 350 Part 7 [Some slides adapted from M. Irwin, D. Paterson and others]

ALU Implementation

Review: Basic Hardware

To warm up let's build a logic unit to support the and and or instructions for MIPS (32-bit registers) We'll just build a 1-bit unit and use 32 of them Possible implementation using a multiplexor : A Simple Multi-Function Logic Unit a b output operation selector

Selects one of the inputs to be the output based on a control input Implementation with a Multiplexor units

Not easy to decide the best way to implement do not want too many inputs to a single gate do not want to have to go through too many gates (= levels) Let's look at a 1-bit ALU for addition: Implementations c out = a.b + a.c in + b.c in sum = a.b.c in + a.b.c in + a.b.c in + a.b.c in = a  b  c in exclusive or (xor)

Building a 1-bit Binary Adder 1 bit Full Adder A B S carry_in carry_out S = A xor B xor carry_in carry_out = A  B v A  carry_in v B  carry_in (majority function)  How can we use it to build a 32-bit adder?  How can we modify it easily to build an adder/subtractor? ABcarry_incarry_outS

1-bit Adder Logic Half-adder with one xor gate Full-adder from 2 half-adders and an or gate Half-adder with the xor gate replaced by primitive gates using the equation A  B = A.B +A.B xor

Building a 32-bit ALU Ripple-Carry Logic for 32-bit ALU 1-bit ALU for AND, OR and add Multiplexor control line

Two's complement approach: just negate b and add. How do we negate? recall negation shortcut : invert each bit of b and set CarryIn to least significant bit (ALU0) to 1 What about Subtraction (a – b) ?

Overflow Detection and Effects  Overflow: the result is too large to represent in the number of bits allocated  When adding operands with different signs, overflow cannot occur! Overflow occurs when l adding two positives yields a negative l or, adding two negatives gives a positive l or, subtract a negative from a positive gives a negative l or, subtract a positive from a negative gives a positive  On overflow, an exception (interrupt) occurs l Control jumps to predefined address for exception l Interrupted address (address of instruction causing the overflow) is saved for possible resumption  Don't always want to detect (interrupt on) overflow

Overflow Detection  Overflow can be detected by: l Carry into MSB xor Carry out of MSB  Logic for overflow detection therefore can be put in to ALU – –4 –

New MIPS Instructions CategoryInstrOp CodeExampleMeaning Arithmetic (R & I format) add unsigned0 and 21addu $s1, $s2, $s3$s1 = $s2 + $s3 sub unsigned0 and 23subu $s1, $s2, $s3$s1 = $s2 - $s3 add imm.unsigned 9addiu $s1, $s2, 6$s1 = $s2 + 6 Data Transfer ld byte unsigned 24lbu $s1, 25($s2)$s1 = Mem($s2+25) ld half unsigned25lhu $s1, 25($s2)$s1 = Mem($s2+25) Cond. Branch (I & R format) set on less than unsigned 0 and 2bsltu $s1, $s2, $s3if ($s2<$s3) $s1=1 else $s1=0 set on less than imm unsigned bsltiu $s1, $s2, 6if ($s2<6) $s1=1 else $s1=0  Sign extend – addiu, addiu, slti, sltiu  Zero extend – andi, ori, xori  Overflow detected – add, addi, sub

Tailoring the ALU to MIPS: Test for Less-than and Equality Need to support the set-on-less-than instruction e.g., slt $t0, $t3, $t4 remember: slt is an R-type instruction that produces 1 if rs < rt and 0 otherwise idea is to use subtraction: rs < rt  rs – rt < 0. Recall msb of negative number is 1 two cases after subtraction rs – rt: if no overflow then rs < rt  most significant bit of rs – rt = 1 if overflow then rs < rt  most significant bit of rs – rt = 0 why? e.g., 5 ten – 6 ten = 0101 – 0110 = = 1111 (ok!) -7 ten – 6 ten = 1001 – 0110 = = 0011 (overflow!) set bit is sent from ALU31 to ALU0 as the Less bit at ALU0; all other Less bits are hardwired 0; so Less is the 32-bit result of slt

Supporting slt 1- bit ALU for the 31 least significant bits 1-bit ALU for the most significant bit Extra set bit, to be routed to the Less input of the least significant 1-bit ALU, is computed from the most significant Result bit and the Overflow bit (it is not the output of the adder as the figure seems to indicate) Less input of the 31 most significant ALUs is always 0 32-bit ALU from 31 copies of ALU at top left and 1 copy of ALU at bottom left in the most significant position

Tailoring the ALU to MIPS: Test for Less-than and Equality Need to support test for equality e.g., beq $t5, $t6, $t7 use subtraction: rs - rt = 0  rs = rt do we need to consider overflow?

Supporting Test for Equality ALU control lines Bneg- Oper- Func- ate ation tion 0 00 and 0 01 or 0 10 add 1 10 sub 1 11 slt Symbol representing ALU Output is 1 only if all Result bits are 0 Combine CarryIn to least significant ALU and Binvert to a single control line as both are always either 1 or 0 32-bit MIPS ALU

Conclusion We can build an ALU to support the MIPS instruction set key idea: use multiplexor to select the output we want we can efficiently perform subtraction using two’s complement we can replicate a 1-bit ALU to produce a 32-bit ALU Important points about hardware speed of a gate depends number of inputs (fan-in) to the gate speed of a circuit depends on number of gates in series (particularly, on the critical path to the deepest level of logic) Speed of MIPS operations clever changes to organization can improve performance (similar to using better algorithms in software) we’ll look at examples for addition, multiplication and division

Building a 32-bit ALU Ripple-Carry Logic for 32-bit ALU 1-bit ALU for AND, OR and add Multiplexor control line

Is a 32-bit ALU as fast as a 1-bit ALU? Why? Is there more than one way to do addition? Yes: one extreme: ripple-carry – carry ripples through 32 ALUs, slow! other extreme: sum-of-products for each CarryIn bit – super fast! CarryIn bits: c 1 = b 0.c 0 + a 0.c 0 + a 0.b 0 c 2 = b 1.c 1 + a 1.c 1 + a 1.b 1 = a 1.a 0.b 0 + a 1.a 0.c 0 + a 1.b 0.c 0 (substituting for c 1 ) + b 1.a 0.b 0 + b 1.a 0.c 0 + b 1.b 0.c 0 + a 1.b 1 c 3 = b 2.c 2 + a 2.c 2 + a 2.b 2 = … = sum of 15 4-term products… How fast? But not feasible for a 32-bit ALU! Why? Exponential complexity!! Problem: Ripple-carry Adder is Slow Note: c i is CarryIn bit into i th ALU; c 0 is the forced CarryIn into the least significant ALU

An approach between our two extremes Motivation: if we didn't know the value of a carry-in, what could we do? when would we always generate a carry? (generate) g i = a i. b i when would we propagate the carry? (propagate) p i = a i + b i Express (carry-in equations in terms of generate/propagates) c 1 = g 0 + p 0.c 0 c 2 = g 1 + p 1.c 1 = g 1 + p 1.g 0 + p 1.p 0.c 0 c 3 = g 2 + p 2.c 2 = g 2 + p 2.g 1 + p 2.p 1.g 0 + p 2.p 1.p 0.c 0 c 4 = g 3 + p 3.c 3 = g 3 + p 3.g 2 + p 3.p 2.g 1 + p 3.p 2.p 1.g 0 + p 3.p 2.p 1.p 0.c 0 Feasible for 4-bit adders – with wider adders unacceptable complexity. solution: build a first level using 4-bit adders, then a second level on top Two-level Carry-lookahead Adder: First Level

Two-level Carry-lookahead Adder: Second Level for a 16-bit adder Propagate signals for each of the four 4-bit adder blocks: P 0 = p 3.p 2.p 1.p 0 P 1 = p 7.p 6.p 5.p 4 P 2 = p 11.p 10.p 9.p 8 P 3 = p 15.p 14.p 13.p 12 Generate signals for each of the four 4-bit adder blocks: G 0 = g 3 + p 3.g 2 + p 3.p 2.g 1 + p 3.p 2.p 1.g 0 G 1 = g 7 + p 7.g 6 + p 7.p 6.g 5 + p 7.p 6.p 5.g 4 G 2 = g 11 + p 11.g 10 + p 11.p 10.g 9 + p 11.p 10.p 9.g 8 G 3 = g 15 + p 15.g 14 + p 15.p 14.g 13 + p 15.p 14.p 13.g 12

Two-level Carry-lookahead Adder: Second Level for a 16-bit adder CarryIn signals for each of the four 4-bit adder blocks (see earlier carry-in equations in terms of generate/propagates): C 1 = G 0 + P 0.c 0 C 2 = G 1 + P 1.G 0 + P 1.P 0.c 0 C 3 = G 2 + P 2.G 1 + P 2.P 1.G 0 + P 2.P 1.P 0.c 0 C 4 = G 3 + P 3.G 2 + P 3.P 2.G 1 + P 3.P 2.P 1.G 0 + P 3.P 2.P 1.P 0.c 0

CarryIn Result0--3 CarryIn Result4--7 CarryIn Result8--11 CarryIn CarryOut Result CarryIn C1 C2 C3 C4 P0 G0 P1 G1 P2 G2 P3 G3 a0 b0 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a6 b6 a7 b7 a8 b8 a9 b9 a10 b10 a11 b11 a12 b12 a13 b13 a14 b14 a15 b15 Logic to compute C1, C2, C3, C4 4bAdder0 4bAdder1 4bAdder2 4bAdder3 16-bit carry-lookahead adder from four 4-bit adders and one carry-lookahead unit Carry-lookahead Logic ALU0 ALU1 ALU2 ALU3 a1 a0 a2 a3 b1 b0 b2 b3 s0 s1 s2 s3 Logic to compute c1, c2, c3, c4, P0, G0 Blow-up of 4-bit adder: (conceptually) consisting of four 1-bit ALUs plus logic to compute all CarryOut bits and one super generate and one super propagate bit. Each 1-bit ALU is exactly as for ripple-carry except c1, c2, c3 for ALUs 1, 2, 3 comes from the extra logic CarryIn Carry-lookahead Unit

Two-level carry-lookahead logic steps: 1. compute p i ’s and g i ’s at each 1-bit ALU 2. compute P i ’s and G i ’s at each 4-bit adder unit 3. compute C i ’s in carry-lookahead unit 4. compute c i ’s at each 4-bit adder unit 5. compute results (sum bits) at each 1-bit ALU Two-level Carry-lookahead Adder: Second Level for a 16-bit adder

Multiplication Standard shift-add method: Multiplicand 1000 Multiplier Product m bits x n bits = m+n bit product Binary makes it easy: multiplier bit 1 => copy multiplicand (1 x multiplicand) multiplier bit 0 => place 0 (0 x multiplicand) 3 versions of multiply hardware & algorithms: x

Shift-add Multiplier Version 1 Algorithm Multiplicand 1000 Multiplier Product

Shift-add Multiplier Version 1 Multiplicand register, product register, ALU are 64-bit wide; multiplier register is 32-bit wide Algorithm 32-bit multiplicand starts at right half of multiplicand register Product register is initialized at 0

Shift-add Multiplier Version1 Itera Step Multiplier Multiplicand Product -tion 0 init values 1 1a … Example: 0010 * 0011: Algorithm

Observations on Multiply Version 1 1 step per clock cycle  nearly 100 clock cycles to multiply two 32-bit numbers Half the bits in the multiplicand register always 0  64-bit adder is wasted 0’s inserted to right as multiplicand is shifted left  least significant bits of product never change once formed Intuition: instead of shifting multiplicand to left, shift product to right…

Shift-add Multiplier Version 2 Algorithm Multiplicand 1000 Multiplier Product

Shift-add Multiplier Version 2 Multiplicand register, multiplier register, ALU are 32-bit wide; product register is 64-bit wide; multiplicand adds to left half of product register Algorithm Product register is initialized at 0

Shift-add Multiplier Version 2 Itera Step Multiplier Multiplicand Product -tion 0 init values 1 1a … Example: 0010 * 0011: Algorithm

Observations on Multiply Version 2 Each step the product register wastes space that exactly matches the current size of the multiplier Intuition: combine multiplier register and product register…

Shift-add Multiplier Version 3 No separate multiplier register; multiplier placed on right side of 64-bit product register Algorithm Product register is initialized with multiplier on right

Shift-add Multiplier Version 3 Itera Step Multiplicand Product -tion 0 init values 1 1a … Example: 0010 * 0011: Algorithm

Observations on Multiply Version 3 2 steps per bit because multiplier & product combined What about signed multiplication? easiest solution is to make both positive and remember whether to negate product when done, i.e., leave out the sign bit, run for 31 steps, then negate if multiplier and multiplicand have opposite signs Booth’s Algorithm is an elegant way to multiply signed numbers using same hardware – it also often quicker…

MIPS Notes MIPS provides two 32-bit registers Hi and Lo to hold a 64-bit product mult, multu (unsigned) put the product of two 32-bit register operands into Hi and Lo : overflow is ignored by MIPS but can be detected by programmer by examining contents of Hi mflo, mfhi moves content of Hi or Lo to a general-purpose register Pseudo-instructions mul (without overflow), mulo (with overflow), mulou (unsigned with overflow) take three 32-bit register operands, putting the product of two registers into the third

Divide 1001 Quotient Divisor Dividend – – Remainder standart method: see how big a multiple of the divisor can be subtracted, creating quotient digit at each step Binary makes it easy  first, try 1 * divisor; if too big, 0 * divisor Dividend = (Quotient * Divisor) + Remainder 3 versions of divide hardware & algorithm:

Divide Version 1 Itera- Step Quotient Divisor Remainder tion 0 init b … Example: 0111 / 0010: Algorithm

Divide Version 1 Divisor register, remainder register, ALU are 64-bit wide; quotient register is 32-bit wide Algorithm 32-bit divisor starts at left half of divisor register Remainder register is initialized with the dividend at right Why 33? Quotient register is initialized to be 0

Divide Version 1

Observations on Divide Version 1 Half the bits in divisor always 0  1/2 of 64-bit adder is wasted  1/2 of divisor register is wasted Intuition: instead of shifting divisor to right, shift remainder to left… Step 1 cannot produce a 1 in quotient bit – as all bits corresponding to the divisor in the remainder register are 0 (remember all operands are 32-bit) Intuition: switch order to shift first and then subtract – can save 1 iteration…

Divide Version 2 Divisor register, quotient register, ALU are 32-bit wide; remainder register is 64-bit wide Remainder register is initialized with the dividend at right Algorithm Why this correction step?

Observations on Divide Version 2 Each step the remainder register wastes space that exactly matches the current size of the quotient Intuition: combine quotient register and remainder register…

Divide Version 3 No separate quotient register; quotient is entered on the right side of the 64-bit remainder register Algorithm Remainder register is initialized with the dividend at right Why this correction step?

Divide Version 3 Example: 0111 / 0010: Itera- Step Divisor Remainder tion 0 init b … 3 4 Algorithm

Observations on Divide Version 3 Same hardware as Multiply Version 3 Signed divide: make both divisor and dividend positive and perform division negate the quotient if divisor and dividend were of opposite signs make the sign of the remainder match that of the dividend this ensures always dividend = (quotient * divisor) + remainder –quotient (x/y) = quotient (–x/y) (e.g. 7 = 3*2 + 1 & –7 = –3*2 – 1)

 Divide generates the reminder in hi and the quotient in lo div $s0, $s1 # lo = $s0 / $s1 # hi = $s0 mod $s1 l Instructions mfhi rd and mflo rd are provided to move the quotient and reminder to (user accessible) registers in the register file MIPS Divide Instruction  As with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0. op rs rt rd shamt funct

MIPS Notes div rs, rt (signed), divu rs, rt (unsigned), with two 32-bit register operands, divide the contents of the operands and put remainder in Hi register and quotient in Lo ; overflow is ignored in both cases pseudo-instructions div rdest, rsrc1, src2 (signed with overflow), divu rdest, rsrc1, src2 (unsigned without overflow) with three 32-bit register operands puts quotients of two registers into third

Floating Point

Review of Numbers Computers are made to deal with numbers What can we represent in N bits? Unsigned integers: 0to2 N - 1 Signed Integers (Two’s Complement) -2 (N-1) to 2 (N-1) - 1

Other Numbers What about other numbers? Very large numbers? (seconds/century) 3,155,760, ( x 10 9 ) Very small numbers? (atomic diameter) ( x ) Rationals (repeating pattern) 2/3 ( ) Irrationals 2 1/2 ( ) Transcendentals e ( ),  ( ) All represented in scientific notation

Scientific Notation (in Decimal) x radix (base) decimal pointmantissa exponent Normalized form: no leadings 0s (exactly one digit to left of decimal point) Alternatives to representing 1/1,000,000,000 Normalized: 1.0 x Not normalized: 0.1 x 10 -8,10.0 x

Scientific Notation (in Binary) 1.0 two x 2 -1 radix (base) “binary point” exponent Computer arithmetic that supports it called floating point, because it represents numbers where the binary point is not fixed, as it is for integers Declare such variable in C as float mantissa

Floating Point Representation (1/2) Normal format: +1.xxxxxxxxxx two *2 yyyy two Multiple of Word Size (32 bits) 0 31 SExponent Significand 1 bit8 bits23 bits S represents Sign Exponent represents y’s Significand represents x’s Represent numbers as small as 2.0 x to as large as 2.0 x 10 38

Floating Point Representation (2/2) What if result too large? (> 2.0x10 38 ) Overflow! Overflow  Exponent larger than represented in 8-bit Exponent field What if result too small? (>0, < 2.0x ) Underflow! Underflow  Negative exponent larger than represented in 8-bit Exponent field How to reduce chances of overflow or underflow?

Double Precision Fl. Pt. Representation Next Multiple of Word Size (64 bits) Double Precision (vs. Single Precision) C variable declared as double Represent numbers almost as small as 2.0 x to almost as large as 2.0 x But primary advantage is greater accuracy due to larger significand 0 31 SExponent Significand 1 bit11 bits20 bits Significand (cont’d) 32 bits

IEEE 754 Floating Point Standard (1/4) Single Precision, DP similar Sign bit:1 means negative 0 means positive Significand: To pack more bits, leading 1 implicit for normalized numbers bits single, bits double always true: Significand < 1 (for normalized numbers) Note: 0 has no leading 1, so reserve exponent value 0 just for number 0

IEEE 754 Floating Point Standard (2/4) Want FP numbers to be used even if no FP hardware; e.g., sort records with FP numbers using integer compares Could break FP number into 3 parts: compare signs, then compare exponents, then compare significands Then want order: Highest order bit is sign ( negative < positive) Exponent next, so big exponent => bigger # Significand last: exponents same => bigger #

IEEE 754 Floating Point Standard (3/4) Negative Exponent? 2’s comp? 1.0 x 2 -1 v. 1.0 x2 +1 (1/2 v. 2) / This notation using integer compare of 1/2 v. 2 makes 1/2 > 2! Instead, pick notation is most negative, and is most positive 1.0 x 2 -1 v. 1.0 x2 +1 (1/2 v. 2) 1/

IEEE 754 Floating Point Standard (4/4) Called Biased Notation, where bias is number subtracted to get real number IEEE 754 uses bias of 127 for single prec. Subtract 127 from Exponent field to get actual value for exponent 1023 is bias for double precision Summary (single precision): 0 31 SExponent Significand 1 bit8 bits23 bits (-1) S x (1 + Significand) x 2 (Exponent-127) Double precision identical, except with exponent bias of 1023

Floating Point numbers approximate values that we want to use. IEEE 754 Floating Point Standard is most widely accepted attempt to standardize interpretation of such numbers Every desktop or server computer sold since ~1997 follows these conventions Summary (single precision): 0 31 SExponent Significand 1 bit8 bits23 bits (-1) S x (1 + Significand) x 2 (Exponent-127) Double precision identical, bias of 1023

Example: Converting Binary FP to Decimal Sign: 0 => positive Exponent: two = 104 ten Bias adjustment: = -23 Significand: 1 + 1x x x x x = = 1.0 ten ten Represents: ten *2 -23 ~ 1.986*10 -7 (about 2/10,000,000)

Converting Decimal to FP (1/3) Simple Case: If denominator is an exponent of 2 (2, 4, 8, 16, etc.), then it’s easy. Show MIPS representation of = -3/4 -11 two /100 two = two Normalized to -1.1 two x 2 -1 (-1) S x (1 + Significand) x 2 (Exponent-127) (-1) 1 x ( ) x 2 ( )

Converting Decimal to FP (2/3) Not So Simple Case: If denominator is not an exponent of 2. Then we can’t represent number precisely, but that’s why we have so many bits in significand: for precision Once we have significand, normalizing a number to get the exponent is easy. So how do we get the significand of a neverending number?

Converting Decimal to FP (3/3) Fact: All rational numbers have a repeating pattern when written out in decimal. Fact: This still applies in binary. To finish conversion: Write out binary number with repeating pattern. Cut it off after correct number of bits (different for single v. double precision). Derive Sign, Exponent and Significand fields.

Example: Representing 1/3 in MIPS 1/3 = … 10 = … = 1/4 + 1/16 + 1/64 + 1/256 + … = … = … 2 * 2 0 = … 2 * 2 -2 Sign: 0 Exponent = = 125 = Significand = …

Representation for ± ∞ In FP, divide by 0 should produce ± ∞, not overflow. Why? OK to do further computations with ∞ E.g., X/0 > Y may be a valid comparison IEEE 754 represents ± ∞ Most positive exponent reserved for ∞ Significands all zeroes

Representation for 0 Represent 0? exponent all zeroes significand all zeroes too What about sign? +0: : Why two zeroes? Helps in some limit comparisons

Special Numbers What have we defined so far? (Single Precision) ExponentSignificandObject 000 0nonzero??? 1-254anything+/- fl. pt. # 2550+/- ∞ 255nonzero???

Representation for Not a Number What is sqrt(-4.0) or 0/0 ? If ∞ not an error, these shouldn’t be either. Called Not a Number (NaN) Exponent = 255, Significand nonzero Why is this useful? Hope NaNs help with debugging? They contaminate: op(NaN, X) = NaN

Floating Point Addition Algorithm:

Floating Point Addition Hardware:

Floating Point Multpication Algorithm:

MIPS Floating Point Architecture (1/4) Separate floating point instructions: Single Precision: add.s, sub.s, mul.s, div.s Double Precision: add.d, sub.d, mul.d, div.d These are far more complicated than their integer counterparts Can take much longer to execute

MIPS Floating Point Architecture (2/4) Problems: Inefficient to have different instructions take vastly differing amounts of time. Generally, a particular piece of data will not change FP  int within a program. -Only 1 type of instruction will be used on it. Some programs do no FP calculations It takes lots of hardware relative to integers to do FP fast

MIPS Floating Point Architecture (3/4) 1990 Solution: Make a completely separate chip that handles only FP. Coprocessor 1: FP chip contains bit registers: $f0, $f1, … most of the registers specified in.s and.d instruction refer to this set separate load and store: lwc1 and swc1 (“load word coprocessor 1”, “store …”) Double Precision: by convention, even/odd pair contain one DP FP number: $f0 / $f1, $f2 / $f3, …, $f30 / $f31 -Even register is the name

MIPS Floating Point Architecture (4/4) Today, FP coprocessor integrated with CPU, or cheap chips may leave out FP HW Instructions to move data between main processor and coprocessors: mfc0, mtc0, mfc1, mtc1, etc. Appendix contains many more FP ops

Performance evaluation

Performance Metrics  Purchasing perspective l given a collection of machines, which has the -best performance ? -least cost ? -best cost/performance?  Design perspective l faced with design options, which has the -best performance improvement ? -least cost ? -best cost/performance?  Both require l basis for comparison l metric for evaluation

Two Notions of “Performance” Plane Boeing 747 BAD/Sud Concorde Top Speed DC to ParisPassengers Throughput (pass x mph) 610 mph6.5 hours470286, mph3 hours132178,200 Which has higher performance? Time to deliver 1 passenger? Time to deliver 400 passengers? In a computer, time for 1 job called Response Time or Execution Time In a computer, jobs per day called Throughput or Bandwidth We will focus primarily on execution time for a single job

Performance Metrics  Normally we are interested in reducing Response time (aka execution time) – the time between the start and the completion of a task  Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors  Additional Metrics l Smallest/lightest l Longest battery life l Most reliable/durable (in space)

What is Time?  Straightforward definition of time: l Total time to complete a task, including disk accesses, memory accesses, I/O activities, operating system overhead,... l “real time”, “response time” or “elapsed time”  Alternative: just time processor (CPU) is working only on your program (since multiple processes running at same time) l “CPU execution time” or “CPU time” l Often divided into system CPU time (in OS) and user CPU time (in user program)

Performance Factors  Want to distinguish elapsed time and the time spent on our task  CPU execution time (CPU time) – time the CPU spends working on a task l Does not include time waiting for I/O or running other programs CPU execution time # CPU clock cycles for a program for a program = x clock cycle time CPU execution time # CPU clock cycles for a program for a program clock rate =  Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program  Many techniques that decrease the number of clock cycles also increase the clock cycle time or

Review: Machine Clock Rate  Clock rate (MHz, GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate

Clock Cycles per Instruction  Not all instructions take the same amount of time to execute l One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction  Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute l A way to compare two different implementations of the same ISA # CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x CPI for this instruction class ABC CPI123

Effective CPI  Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI =  (CPI i x IC i ) i = 1 n l Where IC i is the count (percentage) of the number of instructions of class i executed l CPI i is the (average) number of clock cycles per instruction for that instruction class l n is the number of instruction classes  The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

THE Performance Equation  Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle Instruction_count x CPI clock_rate CPU time = or  These equations separate the three key factors that affect performance l Can measure the CPU execution time by running the program l The clock rate is usually given l Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details l CPI varies by instruction type and ISA implementation for which we must know the implementation details instruction count != the number of lines in the code

Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_ count CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology

Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_ count CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology X XX XX XX X X X X X

A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time?  What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2  =

A Simple Example  How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?  How does this compare with using branch prediction to shave a cycle off the branch time? OpFreqCPI i Freq x CPI i ALU50%1 Load20%5 Store10%3 Branch20%2  = CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

Comparing and Summarizing Performance  Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))  How do we summarize the performance for benchmark set with a single number? l The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) AM = 1/n  Time i i = 1 n l Where Time i is the execution time for the i th program of a total of n programs in the workload l A smaller mean indicates a smaller average execution time and thus improved performance

SPEC Benchmarks Integer benchmarksFP benchmarks gzipcompressionwupwiseQuantum chromodynamics vprFPGA place & routeswimShallow water model gccGNU C compilermgridMultigrid solver in 3D fields mcfCombinatorial optimizationappluParabolic/elliptic pde craftyChess programmesa3D graphics library parserWord processing programgalgelComputational fluid dynamics eonComputer visualizationartImage recognition (NN) perlbmkperl applicationequakeSeismic wave propagation simulation gapGroup theory interpreterfacerecFacial image recognition vortexObject oriented databaseammpComputational chemistry bzip2compressionlucasPrimality testing twolfCircuit place & routefma3dCrash simulation fem sixtrackNuclear physics accel apsiPollutant distribution

Example SPEC Ratings

Other Performance Metrics  Power consumption – especially in the embedded market where battery life is important (and passive cooling) l For power-limited applications, the most important metric is energy efficiency

Summary: Evaluating ISAs  Design-time metrics: l Can it be implemented, in how long, at what cost? l Can it be programmed? Ease of compilation?  Static Metrics: l How many bytes does the program occupy in memory?  Dynamic Metrics: l How many instructions are executed? How many bytes does the processor fetch to execute the program? l How many clocks are required per instruction? Best Metric: Time to execute the program! CPI Inst. CountCycle Time depends on the instructions set, the processor organization, and compilation techniques.

Example A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % What is the average CPI for this session?

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % What is the average CPI for this session? Answer: (20% · 4) + (30% · 3) + (40% · 1) + (10% · 2) = 2.3.

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % How much CPU time was used in this session?

Problem 5 - Performance A certain computer has a CPU running at 500 MHz. Your PINE session on this machine ends up executing 200 million instructions, of the following mix: Type CPI Freq A 4 20 % B 3 30 % C 1 40 % D 2 10 % How much CPU time was used in this session? Answer: (200 x 10 6 instructions) · (2 ns/cycle) · (2.3 cycles/instruction) = 0.92 seconds.