1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson Course syllabus, calendar, and assignments found at

1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson dan.watson@usu.edu Course syllabus, calendar, and assignments found at http://www.cs.usu.edu/~watson/cs2810 These overheads are based on presentations courtesy of Professor Mary Jane Irwin, Penn State University and Professor Tod Amon, Southern Utah University

2  2004 Morgan Kaufmann Publishers Chapter 1

3  2004 Morgan Kaufmann Publishers Introduction This course is all about how computers work But what do we mean by a computer? –Different types: desktop, servers, embedded devices –Different uses: automobiles, graphics, finance, genomics… –Different manufacturers: Intel, Apple, IBM, Microsoft, Sun… –Different underlying technologies and different costs! Analogy: Consider a course on “automotive vehicles” –Many similarities from vehicle to vehicle (e.g., wheels) –Huge differences from vehicle to vehicle (e.g., gas vs. electric) Best way to learn: –Focus on a specific instance and learn how it works –While learning general principles and historical perspectives

4  2004 Morgan Kaufmann Publishers Why learn this stuff? You want to call yourself a “computer scientist” You want to build software people use (need performance) You need to make a purchasing decision or offer “expert” advice Both Hardware and Software affect performance: –Algorithm determines number of source-level statements –Language/Compiler/Architecture determine machine instructions (Chapter 2 and 3) –Processor/Memory determine how fast instructions are executed (Chapter 5, 6, and 7) Assessing and Understanding Performance in Chapter 4

5  2004 Morgan Kaufmann Publishers What is a computer? Components: –input (mouse, keyboard) –output (display, printer) –memory (disk drives, DRAM, SRAM, CD) –network Our primary focus: the processor (datapath and control) –implemented using millions of transistors –Impossible to understand by looking at each transistor –We need...

6  2004 Morgan Kaufmann Publishers Where is the Market? Millions of Computers

7  2004 Morgan Kaufmann Publishers By the architecture of a system, I mean the complete and detailed specification of the user interface. … As Blaauw has said, “Where architecture tells what happens, implementation tells how it is made to happen.” The Mythical Man-Month, Brooks, pg 45

8  2004 Morgan Kaufmann Publishers Instruction Set Architecture (ISA) ISA: An abstract interface between the hardware and the lowest level software of a machine that encompasses all the information necessary to write a machine language program that will run correctly, including instructions, registers, memory access, I/O, and so on. “... the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls, the logic design, and the physical implementation.” – Amdahl, Blaauw, and Brooks, 1964 –Enables implementations of varying cost and performance to run identical software ABI (application binary interface): The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers.

9  2004 Morgan Kaufmann Publishers ISA Type Sales PowerPoint “comic” bar chart with approximate values (see text for correct values) Millions of Processor

10  2004 Morgan Kaufmann Publishers Moore’s Law In 1965, Gordon Moore predicted that the number of transistors that can be integrated on a die would double every 18 to 24 months (i.e., grow exponentially with time). Amazingly visionary – million transistor/chip barrier was crossed in the 1980’s. –2300 transistors, 1 MHz clock (Intel 4004) - 1971 –16 Million transistors (Ultra Sparc III) –42 Million transistors, 2 GHz clock (Intel Xeon) – 2001 –55 Million transistors, 3 GHz, 130nm technology, 250mm 2 die (Intel Pentium 4) - 2004 –140 Million transistor (HP PA-8500)

11  2004 Morgan Kaufmann Publishers Historical Perspective ENIAC built in World War II was the first general purpose computer –Used for computing artillery firing tables –80 feet long by 8.5 feet high and several feet wide –Each of the twenty 10 digit registers was 2 feet long –Used 18,000 vacuum tubes –Performed 1900 additions per second –Since then: Moore’s Law: transistor capacity doubles every 18-24 months

12  2004 Morgan Kaufmann Publishers Processor Performance Increase SUN-4/260MIPS M/120 MIPS M2000 IBM RS6000 HP 9000/750 DEC AXP/500 IBM POWER 100 DEC Alpha 4/266 DEC Alpha 5/500 DEC Alpha 21264/600 DEC Alpha 5/300 DEC Alpha 21264A/667 Intel Xeon/2000 Intel Pentium 4/3000

13  2004 Morgan Kaufmann Publishers DRAM Capacity Growth 16K 64K 256K 1M 4M 16M 64M 128M 256M 512M

14  2004 Morgan Kaufmann Publishers Impacts of Advancing Technology Processor –logic capacity:increases about 30% per year –performance:2x every 1.5 years Memory –DRAM capacity:4x every 3 years, now 2x every 2 years –memory speed:1.5x every 10 years –cost per bit:decreases about 25% per year Disk –capacity:increases about 60% per year

15  2004 Morgan Kaufmann Publishers Impacts of Advancing Technology Processor –logic capacity:increases about 30% per year –performance:2x every 1.5 years Memory –DRAM capacity:4x every 3 years, now 2x every 2 years –memory speed:1.5x every 10 years –cost per bit:decreases about 25% per year Disk –capacity:increases about 60% per year ClockCycle = 1/ClockRate 500 MHz ClockRate = 2 nsec ClockCycle 1 GHz ClockRate = 1 nsec ClockCycle 4 GHz ClockRate = 250 psec ClockCycle

16  2004 Morgan Kaufmann Publishers Example Machine Organization Workstation design target –25% of cost on processor –25% of cost on memory (minimum memory size) –Rest on I/O devices, power supplies, box CPU Computer Control Datapath MemoryDevices Input Output

17  2004 Morgan Kaufmann Publishers PC Motherboard Closeup

18  2004 Morgan Kaufmann Publishers Inside the Pentium 4 Processor Chip

19  2004 Morgan Kaufmann Publishers Example Machine Organization TI SuperSPARC tm TMS390Z50 in Sun SPARCstation20 Floating-point Unit Integer Unit Inst Cache Ref MMU Data Cache Store Buffer Bus Interface SuperSPARC L2 $ CC MBus Module MBus L64852 MBus control M-S Adapter SBus DRAM Controller SBus DMA SCSI Ethernet STDIO serial kbd mouse audio RTC Boot PROM Floppy SBus Cards

20  2004 Morgan Kaufmann Publishers Instruction Set Architecture A very important abstraction –interface between hardware and low-level software –standardizes instructions, machine language bit patterns, etc. –advantage: different implementations of the same architecture –disadvantage: sometimes prevents using new innovations True or False: Binary compatibility is extraordinarily important? Modern instruction set architectures: –IA-32, PowerPC, MIPS, SPARC, ARM, and others

21  2004 Morgan Kaufmann Publishers Abstraction Delving into the depths reveals more information An abstraction omits unneeded detail, helps us cope with complexity What are some of the details that appear in these familiar abstractions?

22  2004 Morgan Kaufmann Publishers MIPS R3000 Instruction Set Architecture Instruction Categories –Load/Store –Computational –Jump and Branch –Floating Point coprocessor –Memory Management –Special R0 - R31 PC HI LO OP rs rt rdsafunct rs rt immediate jump target 3 Instruction Formats: all 32 bits wide Registers Q: How many already familiar with MIPS ISA?

23  2004 Morgan Kaufmann Publishers How do computers work? Need to understand abstractions such as: –Applications software –Systems software –Assembly Language –Machine Language –Architectural Issues: i.e., Caches, Virtual Memory, Pipelining –Sequential logic, finite state machines –Combinational logic, arithmetic circuits –Boolean logic, 1s and 0s –Transistors used to build logic gates (CMOS) –Semiconductors/Silicon used to build transistors –Properties of atoms, electrons, and quantum dynamics So much to learn!

25  2004 Morgan Kaufmann Publishers Instructions: Language of the Machine We’ll be working with the MIPS instruction set architecture –similar to other architectures developed since the 1980's –Almost 100 million MIPS processors manufactured in 2002 –used by NEC, Nintendo, Cisco, Silicon Graphics, Sony, …

26  2004 Morgan Kaufmann Publishers MIPS arithmetic All instructions have 3 operands Operand order is fixed (destination first) Example: C code: a = b + c MIPS ‘code’: add a, b, c (we’ll talk about registers in a bit) “The natural number of operands for an operation like addition is three…requiring every instruction to have exactly three operands, no more and no less, conforms to the philosophy of keeping the hardware simple”

27  2004 Morgan Kaufmann Publishers MIPS arithmetic Design Principle: simplicity favors regularity. Of course this complicates some things... C code: a = b + c + d; MIPS code: add a, b, c add a, a, d Operands must be registers, only 32 registers provided Each register contains 32 bits Design Principle: smaller is faster. Why?

28  2004 Morgan Kaufmann Publishers Registers vs. Memory ProcessorI/O Control Datapath Memory Input Output Arithmetic instructions operands must be registers, — only 32 registers provided Compiler associates variables with registers What about programs with lots of variables

29  2004 Morgan Kaufmann Publishers Memory Organization Viewed as a large, single-dimension array, with an address. A memory address is an index into the array "Byte addressing" means that the index points to a byte of memory. 0 1 2 3 4 5 6... 8 bits of data

30  2004 Morgan Kaufmann Publishers Memory Organization Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes. 2 32 bytes with byte addresses from 0 to 2 32 -1 2 30 words with byte addresses 0, 4, 8,... 2 32 -4 Words are aligned i.e., what are the least 2 significant bits of a word address? 0 4 8 12... 32 bits of data Registers hold 32 bits of data

31  2004 Morgan Kaufmann Publishers Instructions Load and store instructions Example: C code: A[12] = h + A[8]; MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 48($s3) Can refer to registers by name (e.g., $s2, $t2) instead of number Store word has destination last Remember arithmetic operands are registers, not memory! Can’t write: add 48($s3), $s2, 32($s3)

32  2004 Morgan Kaufmann Publishers Our First Example Can we figure out the code? swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; } swap: muli $2, $5, 4 add $2, $4, $2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31

33  2004 Morgan Kaufmann Publishers So far we’ve learned: MIPS — loading words but addressing bytes — arithmetic on registers only InstructionMeaning add $s1, $s2, $s3$s1 = $s2 + $s3 sub $s1, $s2, $s3$s1 = $s2 – $s3 lw $s1, 100($s2)$s1 = Memory[$s2+100] sw $s1, 100($s2)Memory[$s2+100] = $s1

34  2004 Morgan Kaufmann Publishers Instructions, like registers and words of data, are also 32 bits long –Example: add $t1, $s1, $s2 –registers have numbers, $t1=9, $s1=17, $s2=18 Instruction Format: 00000010001100100100000000100000 op rs rt rdshamtfunct Can you guess what the field names stand for? Machine Language

35  2004 Morgan Kaufmann Publishers Consider the load-word and store-word instructions, –What would the regularity principle have us do? –New principle: Good design demands a compromise Introduce a new type of instruction format –I-type for data transfer instructions –other format was R-type for register Example: lw $t0, 32($s2) 35 18 9 32 op rs rt 16 bit number Where's the compromise? Machine Language

36  2004 Morgan Kaufmann Publishers Instructions are bits Programs are stored in memory — to be read or written just like data Fetch & Execute Cycle –Instructions are fetched and put into a special register –Bits in the register "control" the subsequent actions –Fetch the “next” instruction and continue ProcessorMemory memory for data, programs, compilers, editors, etc. Stored Program Concept

37  2004 Morgan Kaufmann Publishers Decision making instructions –alter the control flow, –i.e., change the "next" instruction to be executed MIPS conditional branch instructions: bne $t0, $t1, Label beq $t0, $t1, Label Example: if (i==j) h = i + j; bne $s0, $s1, Label add $s3, $s0, $s1 Label:.... Control

38  2004 Morgan Kaufmann Publishers MIPS unconditional branch instructions: j label Example: if (i!=j) beq $s4, $s5, Lab1 h=i+j;add $s3, $s4, $s5 else j Lab2 h=i-j;Lab1:sub $s3, $s4, $s5 Lab2:... Can you build a simple for loop? Control

39  2004 Morgan Kaufmann Publishers So far: InstructionMeaning add $s1,$s2,$s3$s1 = $s2 + $s3 sub $s1,$s2,$s3$s1 = $s2 – $s3 lw $s1,100($s2)$s1 = Memory[$s2+100] sw $s1,100($s2)Memory[$s2+100] = $s1 bne $s4,$s5,LNext instr. is at Label if $s4 ≠ $s5 beq $s4,$s5,LNext instr. is at Label if $s4 = $s5 j LabelNext instr. is at Label Formats: op rs rt rdshamtfunct op rs rt 16 bit address op 26 bit address RIJRIJ

40  2004 Morgan Kaufmann Publishers We have: beq, bne, what about Branch-if-less-than? New instruction: if $s1 < $s2 then $t0 = 1 slt $t0, $s1, $s2 else $t0 = 0 Can use this instruction to build " blt $s1, $s2, Label " — can now build general control structures Note that the assembler needs a register to do this, — there are policy of use conventions for registers Control Flow

41  2004 Morgan Kaufmann Publishers Policy of Use Conventions Register 1 ($at) reserved for assembler, 26-27 for operating system

42  2004 Morgan Kaufmann Publishers Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not? –put 'typical constants' in memory and load them. –create hard-wired registers (like $zero) for constants like one. MIPS Instructions: addi $29, $29, 4 slti $8, $18, 10 andi $29, $29, 6 ori $29, $29, 4 Design Principle: Make the common case fast. Which format? Constants

43  2004 Morgan Kaufmann Publishers We'd like to be able to load a 32 bit constant into a register Must use two instructions, new "load upper immediate" instruction lui $t0, 1010101010101010 Then must get the lower order bits right, i.e., ori $t0, $t0, 1010101010101010 10101010101010100000000000000000 1010101010101010 ori 10101010101010100000000000000000 filled with zeros How about larger constants?

44  2004 Morgan Kaufmann Publishers Assembly provides convenient symbolic representation –much easier than writing down numbers –e.g., destination first Machine language is the underlying reality –e.g., destination is no longer first Assembly can provide 'pseudoinstructions' –e.g., “move $t0, $t1” exists only in Assembly –would be implemented using “add $t0,$t1,$zero” When considering performance you should count real instructions Assembly Language vs. Machine Language

45  2004 Morgan Kaufmann Publishers Discussed in your assembly language programming lab: support for procedures linkers, loaders, memory layout stacks, frames, recursion manipulating strings and pointers interrupts and exceptions system calls and conventions Some of these we'll talk more about later We’ll talk about compiler optimizations when we hit chapter 4. Other Issues

46  2004 Morgan Kaufmann Publishers simple instructions all 32 bits wide very structured, no unnecessary baggage only three instruction formats rely on compiler to achieve performance — what are the compiler's goals? help compiler where we can op rs rt rdshamtfunct op rs rt 16 bit address op 26 bit address RIJRIJ Overview of MIPS

47  2004 Morgan Kaufmann Publishers Instructions: bne $t4,$t5,Label Next instruction is at Label if $t4 ° $t5 beq $t4,$t5,Label Next instruction is at Label if $t4 = $t5 j Label Next instruction is at Label Formats: Addresses are not 32 bits — How do we handle this with load and store instructions? op rs rt 16 bit address op 26 bit address IJIJ Addresses in Branches and Jumps

48  2004 Morgan Kaufmann Publishers Instructions: bne $t4,$t5,Label Next instruction is at Label if $t4≠$t5 beq $t4,$t5,Label Next instruction is at Label if $t4=$t5 Formats: Could specify a register (like lw and sw) and add it to address –use Instruction Address Register (PC = program counter) –most branches are local (principle of locality) Jump instructions just use high order bits of PC –address boundaries of 256 MB op rs rt 16 bit address I Addresses in Branches

49  2004 Morgan Kaufmann Publishers To summarize:

50  2004 Morgan Kaufmann Publishers

51  2004 Morgan Kaufmann Publishers CSE 431 Computer Architecture Fall 2005 Lecture 02: MIPS ISA Review Mary Jane Irwin ( www.cse.psu.edu/~mji )www.cse.psu.edu/~mji www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, UCB]

52  2004 Morgan Kaufmann Publishers (vonNeumann) Processor Organization Control needs to 1.input instructions from Memory 2.issue signals to control the information flow between the Datapath components and to control what operations they perform 3.control instruction sequencing Fetch DecodeExec CPU Control Datapath MemoryDevices Input Output Datapath needs to have the –components – the functional units and storage (e.g., register file) needed to execute instructions –interconnects - components connected so that the instructions can be accomplished and so that data can be loaded from and stored to Memory

53  2004 Morgan Kaufmann Publishers RISC - Reduced Instruction Set Computer RISC philosophy –fixed instruction lengths –load-store instruction sets –limited addressing modes –limited operations MIPS, Sun SPARC, HP PA-RISC, IBM PowerPC, Intel (Compaq) Alpha, … Instruction sets are measured by how well compilers use them as opposed to how well assembly language programmers use them Design goals: speed, cost (design, fabrication, test, packaging), size, power consumption, reliability, memory space (embedded systems)

54  2004 Morgan Kaufmann Publishers MIPS R3000 Instruction Set Architecture (ISA) Instruction Categories –Computational –Load/Store –Jump and Branch –Floating Point coprocessor –Memory Management –Special R0 - R31 PC HI LO Registers OP rs rt rdsafunct rs rt immediate jump target 3 Instruction Formats: all 32 bits wide R format I format J format

55  2004 Morgan Kaufmann Publishers Review: Unsigned Binary Representation HexBinaryDecimal 0x000000000…00000 0x000000010…00011 0x000000020…00102 0x000000030…00113 0x000000040…01004 0x000000050…01015 0x000000060…01106 0x000000070…01117 0x000000080…10008 0x000000090…10019 … 0xFFFFFFFC1…1100 0xFFFFFFFD1…1101 0xFFFFFFFE1…1110 0xFFFFFFFF1…1111 2 32 - 1 2 32 - 2 2 32 - 3 2 32 - 4 2 32 - 1 1 1 1... 1 1 1 1 bit 31 30 29... 3 2 1 0 bit position 2 31 2 30 2 29... 2 3 2 2 2 1 2 0 bit weight 1 0 0 0... 0 0 0 0 - 1

56  2004 Morgan Kaufmann Publishers Aside: Beyond Numbers American Std Code for Info Interchange (ASCII): 8-bit bytes representing characters ASCIICharASCIICharASCIICharASCIICharASCIICharASCIIChar 0Null32space48064@96`112p 133!49165A97a113q 234“50266B98b114r 335#51367C99c115s 4EOT36$52468D100d116t 537%53569E101e117u 6ACK38&54670F102f118v 739‘55771G103g119w 8bksp40(56872H104h120x 9tab41)57973I105i121y 10LF42*58:74J106j122z 1143+59;75K107k123{ 12FF44,60<76L108l124| 1547/63?79O111o127DEL

57  2004 Morgan Kaufmann Publishers MIPS Arithmetic Instructions MIPS assembly language arithmetic statement add$t0, $s1, $s2 sub$t0, $s1, $s2 Each arithmetic instruction performs only one operation Each arithmetic instruction fits in 32 bits and specifies exactly three operands destination  source1 op source2 Those operands are all contained in the datapath’s register file ( $t0,$s1,$s2 ) – indicated by $ Operand order is fixed (destination first)

58  2004 Morgan Kaufmann Publishers MIPS Arithmetic Instructions MIPS assembly language arithmetic statement add$t0, $s1, $s2 sub$t0, $s1, $s2 Each arithmetic instruction performs only one operation Each arithmetic instruction fits in 32 bits and specifies exactly three operands destination  source1 op source2 Each arithmetic instruction performs only one operation Each arithmetic instruction fits in 32 bits and specifies exactly three operands destination  source1 op source2 Operand order is fixed (destination first) Those operands are all contained in the datapath’s register file ( $t0,$s1,$s2 ) – indicated by $

59  2004 Morgan Kaufmann Publishers Aside: MIPS Register Convention NameRegister Number UsagePreserve on call? $zero0constant 0 (hardware)n.a. $at1reserved for assemblern.a. $v0 - $v12-3returned valuesno $a0 - $a34-7argumentsyes $t0 - $t78-15temporariesno $s0 - $s716-23saved valuesyes $t8 - $t924-25temporariesno $gp28global pointeryes $sp29stack pointeryes $fp30frame pointeryes $ra31return addr (hardware)yes

60  2004 Morgan Kaufmann Publishers MIPS Register File Register File src1 addr src2 addr dst addr write data 32 bits src1 data src2 data 32 locations 32 5 5 5 Holds thirty-two 32-bit registers –Two read ports and –One write port Registers are –Faster than main memory But register files with more locations are slower (e.g., a 64 word file could be as much as 50% slower than a 32 word file) Read/write port increase impacts speed quadratically –Easier for a compiler to use e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order vs. stack –Can hold variables so that code density improves (since register are named with fewer bits than a memory location) write control

61  2004 Morgan Kaufmann Publishers Instructions, like registers and words of data, are 32 bits long Arithmetic Instruction Format (R format): add $t0, $s1, $s2 Machine Language - Add Instruction op rs rt rd shamt funct op6-bitsopcode that specifies the operation rs5-bitsregister file address of the first source operand rt5-bitsregister file address of the second source operand rd5-bitsregister file address of the result’s destination shamt5-bitsshift amount (for shift instructions) funct6-bitsfunction code augmenting the opcode

62  2004 Morgan Kaufmann Publishers MIPS Memory Access Instructions MIPS has two basic data transfer instructions for accessing memory lw$t0, 4($s3) #load word from memory sw$t0, 8($s3) #store word to memory The data is loaded into (lw) or stored from (sw) a register in the register file – a 5 bit address The memory address – a 32 bit address – is formed by adding the contents of the base address register to the offset value –A 16-bit field meaning access is limited to memory locations within a region of  2 13 or 8,192 words (  2 15 or 32,768 bytes) of the address in the base register –Note that the offset can be positive or negative

63  2004 Morgan Kaufmann Publishers Load/Store Instruction Format (I format): lw $t0, 24($s2) Machine Language - Load Instruction op rs rt 16 bit offset Memory dataword address (hex) 0x00000000 0x00000004 0x00000008 0x0000000c 0xf f f f f f f f $s2 0x12004094 24 10 + $s2 =... 0001 1000 +... 1001 0100... 1010 1100 = 0x120040ac $t0

64  2004 Morgan Kaufmann Publishers Byte Addresses Since 8-bit bytes are so useful, most architectures address individual bytes in memory –The memory address of a word must be a multiple of 4 (alignment restriction) Big Endian: leftmost byte is word address IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA Little Endian:rightmost byte is word address Intel 80x86, DEC Vax, DEC Alpha (Windows NT) msblsb 3 2 1 0 little endian byte 0 0 1 2 3 big endian byte 0

65  2004 Morgan Kaufmann Publishers Aside: Loading and Storing Bytes MIPS provides special instructions to move bytes lb$t0, 1($s3) #load byte from memory sb$t0, 6($s3) #store byte to memory op rs rt 16 bit offset What 8 bits get loaded and stored? –load byte places the byte from memory in the rightmost 8 bits of the destination register what happens to the other bits in the register? –store byte takes the byte from the rightmost 8 bits of a register and writes it to a byte in memory what happens to the other bits in the memory word?

66  2004 Morgan Kaufmann Publishers MIPS conditional branch instructions: bne $s0, $s1, Lbl#go to Lbl if $s0  $s1 beq $s0, $s1, Lbl#go to Lbl if $s0=$s1 –Ex: if (i==j) h = i + j; bne $s0, $s1, Lbl1 add $s3, $s0, $s1 Lbl1:... MIPS Control Flow Instructions Instruction Format (I format): op rs rt 16 bit offset How is the branch destination address specified?

67  2004 Morgan Kaufmann Publishers Specifying Branch Destinations Use a register (like in lw and sw) added to the 16-bit offset –which register? Instruction Address Register (the PC) its use is automatically implied by instruction PC gets updated (PC+4) during the fetch cycle so that it holds the address of the next instruction –limits the branch distance to -2 15 to +2 15 -1 instructions from the (instruction after the) branch instruction, but most branches are local anyway PC Add 32 offset 16 32 00 sign-extend from the low order 16 bits of the branch instruction branch dst address ? Add 4 32

68  2004 Morgan Kaufmann Publishers We have beq, bne, but what about other kinds of brances (e.g., branch-if-less-than)? For this, we need yet another instruction, slt Set on less than instruction: slt $t0, $s0, $s1 # if $s0 < $s1 then # $t0 = 1else # $t0 = 0 Instruction format (R format): 2 More Branch Instructions op rs rt rd funct

69  2004 Morgan Kaufmann Publishers More Branch Instructions, Con’t Can use slt, beq, bne, and the fixed value of 0 in register $zero to create other conditions –less than blt $s1, $s2, Label –less than or equal to ble $s1, $s2, Label –greater than bgt $s1, $s2, Label –great than or equal to bge $s1, $s2, Label slt $at, $s1, $s2#$at set to 1 if bne $at, $zero, Label# $s1 < $s2 Such branches are included in the instruction set as pseudo instructions - recognized (and expanded) by the assembler –Its why the assembler needs a reserved register ( $at )

70  2004 Morgan Kaufmann Publishers MIPS also has an unconditional branch instruction or jump instruction: j label#go to label Other Control Flow Instructions Instruction Format (J Format): op 26-bit address PC 4 32 26 32 00 from the low order 26 bits of the jump instruction

71  2004 Morgan Kaufmann Publishers Aside: Branching Far Away What if the branch destination is further away than can be captured in 16 bits?  The assembler comes to the rescue – it inserts an unconditional jump to the branch target and inverts the condition beq$s0, $s1, L1 becomes bne$s0, $s1, L2 jL1 L2:

72  2004 Morgan Kaufmann Publishers MIPS procedure call instruction: jalProcedureAddress#jump and link Saves PC+4 in register $ra to have a link to the next instruction for the procedure return Machine format (J format): Then can do procedure return with a jr$ra#return Instruction format (R format): Instructions for Accessing Procedures op 26 bit address op rs funct

73  2004 Morgan Kaufmann Publishers Aside: Spilling Registers What if the callee needs more registers? What if the procedure is recursive? –uses a stack – a last-in-first-out queue – in memory for passing additional values or saving (recursive) return address(es)  One of the general registers, $sp, is used to address the stack (which “grows” from high address to low address) l add data onto the stack – push $sp = $sp – 4 data on stack at new $sp l remove data from the stack – pop data from stack at $sp $sp = $sp + 4 low addr high addr $sptop of stack

74  2004 Morgan Kaufmann Publishers addi$sp, $sp, 4#$sp = $sp + 4 slti $t0, $s2, 15#$t0 = 1 if $s2<15 Machine format (I format): MIPS Immediate Instructions op rs rt 16 bit immediate I format Small constants are used often in typical code Possible approaches? –put “typical constants” in memory and load them –create hard-wired registers (like $zero) for constants like 1 –have special instructions that contain constants ! The constant is kept inside the instruction itself! –Immediate format limits values to the range +2 15 –1 to -2 15

75  2004 Morgan Kaufmann Publishers We'd also like to be able to load a 32 bit constant into a register, for this we must use two instructions a new "load upper immediate" instruction lui $t0, 1010101010101010 Then must get the lower order bits right, use ori $t0, $t0, 1010101010101010 Aside: How About Larger Constants? 16 0 8 1010101010101010 1010101010101010 00000000000000001010101010101010 0000000000000000 1010101010101010

76  2004 Morgan Kaufmann Publishers MIPS Organization So Far Processor Memory 32 bits 2 30 words read/write addr read data write data word address (binary) 0…0000 0…0100 0…1000 0…1100 1…1100 Register File src1 addr src2 addr dst addr write data 32 bits src1 data src2 data 32 registers ($zero - $ra) 32 5 5 5 PC ALU 32 0123 7654 byte address (big Endian) Fetch PC = PC+4 DecodeExec Add 32 4 Add 32 branch offset

77  2004 Morgan Kaufmann Publishers MIPS ISA So Far CategoryInstrOp CodeExampleMeaning Arithmetic (R & I format) add0 and 32add $s1, $s2, $s3$s1 = $s2 + $s3 subtract0 and 34sub $s1, $s2, $s3$s1 = $s2 - $s3 add immediate8addi $s1, $s2, 6$s1 = $s2 + 6 or immediate13ori $s1, $s2, 6$s1 = $s2 v 6 Data Transfer (I format) load word35lw $s1, 24($s2)$s1 = Memory($s2+24) store word43sw $s1, 24($s2)Memory($s2+24) = $s1 load byte32lb $s1, 25($s2)$s1 = Memory($s2+25) store byte40sb $s1, 25($s2)Memory($s2+25) = $s1 load upper imm15lui $s1, 6$s1 = 6 * 2 16 Cond. Branch (I & R format) br on equal4beq $s1, $s2, Lif ($s1==$s2) go to L br on not equal5bne $s1, $s2, Lif ($s1 !=$s2) go to L set on less than0 and 42slt $s1, $s2, $s3if ($s2<$s3) $s1=1 else $s1=0 set on less than immediate 10slti $s1, $s2, 6if ($s2<6) $s1=1 else $s1=0 Uncond. Jump (J & R format) jump2j 2500go to 10000 jump register0 and 8jr $t1go to $t1 jump and link3jal 2500go to 10000; $ra=PC+4

78  2004 Morgan Kaufmann Publishers Review of MIPS Operand Addressing Modes Register addressing – operand is in a register Base (displacement) addressing – operand is at the memory location whose address is the sum of a register and a 16-bit constant contained within the instruction –Register relative (indirect) with 0($a0) –Pseudo-direct with addr($zero) Immediate addressing – operand is a 16-bit constant contained within the instruction op rs rt rd funct Register word operand base register op rs rt offset Memory word or byte operand op rs rt operand

79  2004 Morgan Kaufmann Publishers Review of MIPS Instruction Addressing Modes PC-relative addressing –instruction address is the sum of the PC and a 16-bit constant contained within the instruction Pseudo-direct addressing – instruction address is the 26-bit constant contained within the instruction concatenated with the upper 4 bits of the PC op rs rt offset Program Counter (PC) Memory branch destination instruction op jump address Program Counter (PC) Memory jump destination instruction||

80  2004 Morgan Kaufmann Publishers MIPS (RISC) Design Principles Simplicity favors regularity –fixed size instructions – 32-bits –small number of instruction formats –opcode always the first 6 bits Good design demands good compromises –three instruction formats Smaller is faster –limited instruction set –limited number of registers in register file –limited number of addressing modes Make the common case fast –arithmetic operands from the register file (load-store machine) –allow instructions to contain immediate operands

81  2004 Morgan Kaufmann Publishers Chapter Three

82  2004 Morgan Kaufmann Publishers Bits are just bits (no inherent meaning) — conventions define relationship between bits and numbers Binary numbers (base 2) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001... decimal: 0...2 n -1 Of course it gets more complicated: numbers are finite (overflow) fractions and real numbers negative numbers e.g., no MIPS subi instruction; addi can add a negative number How do we represent negative numbers? i.e., which bit patterns will represent which numbers? Numbers

83  2004 Morgan Kaufmann Publishers Sign Magnitude: One's Complement Two's Complement 000 = +0000 = +0000 = +0 001 = +1001 = +1001 = +1 010 = +2010 = +2010 = +2 011 = +3011 = +3011 = +3 100 = -0100 = -3100 = -4 101 = -1101 = -2101 = -3 110 = -2110 = -1110 = -2 111 = -3111 = -0111 = -1 Issues: balance, number of zeros, ease of operations Which one is best? Why? Possible Representations

84  2004 Morgan Kaufmann Publishers 32 bit signed numbers: 0000 0000 0000 0000 0000 0000 0000 0000 two = 0 ten 0000 0000 0000 0000 0000 0000 0000 0001 two = + 1 ten 0000 0000 0000 0000 0000 0000 0000 0010 two = + 2 ten... 0111 1111 1111 1111 1111 1111 1111 1110 two = + 2,147,483,646 ten 0111 1111 1111 1111 1111 1111 1111 1111 two = + 2,147,483,647 ten 1000 0000 0000 0000 0000 0000 0000 0000 two = – 2,147,483,648 ten 1000 0000 0000 0000 0000 0000 0000 0001 two = – 2,147,483,647 ten 1000 0000 0000 0000 0000 0000 0000 0010 two = – 2,147,483,646 ten... 1111 1111 1111 1111 1111 1111 1111 1101 two = – 3 ten 1111 1111 1111 1111 1111 1111 1111 1110 two = – 2 ten 1111 1111 1111 1111 1111 1111 1111 1111 two = – 1 ten maxint minint MIPS

85  2004 Morgan Kaufmann Publishers 32-bit signed numbers (2’s complement): 0000 0000 0000 0000 0000 0000 0000 0000 two = 0 ten 0000 0000 0000 0000 0000 0000 0000 0001 two = + 1 ten... 0111 1111 1111 1111 1111 1111 1111 1110 two = + 2,147,483,646 ten 0111 1111 1111 1111 1111 1111 1111 1111 two = + 2,147,483,647 ten 1000 0000 0000 0000 0000 0000 0000 0000 two = – 2,147,483,648 ten 1000 0000 0000 0000 0000 0000 0000 0001 two = – 2,147,483,647 ten... 1111 1111 1111 1111 1111 1111 1111 1110 two = – 2 ten 1111 1111 1111 1111 1111 1111 1111 1111 two = – 1 ten MIPS Number Representations maxint minint Converting <32-bit values into 32-bit values –copy the most significant bit (the sign bit) into the “empty” bits 0010 -> 0000 0010 1010 -> 1111 1010 –sign extend versus zero extend ( lb vs. lbu ) MSB LSB

86  2004 Morgan Kaufmann Publishers MIPS Arithmetic Logic Unit (ALU) Must support the Arithmetic/Logic operations of the ISA add, addi, addiu, addu sub, subu, neg mult, multu, div, divu sqrt and, andi, nor, or, ori, xor, xori beq, bne, slt, slti, sltiu, sltu 32 m (operation) result A B ALU 4 zeroovf 1 1 With special handling for –sign extend – addi, addiu andi, ori, xori, slti, sltiu –zero extend – lbu, addiu, sltiu –no overflow detected – addu, addiu, subu, multu, divu, sltiu, sltu

87  2004 Morgan Kaufmann Publishers Negating a two's complement number: invert all bits and add 1 –remember: “negate” and “invert” are quite different! Converting n bit numbers into numbers with more than n bits: –MIPS 16 bit immediate gets converted to 32 bits for arithmetic –copy the most significant bit (the sign bit) into the other bits 0010 -> 0000 0010 1010 -> 1111 1010 –"sign extension" (lbu vs. lb) Two's Complement Operations

88  2004 Morgan Kaufmann Publishers Review: 2’s Complement Binary Representation 2’sc binarydecimal 1000-8 1001-7 1010-6 1011-5 1100-4 1101-3 1110-2 1111 00000 00011 00102 00113 01004 01015 01106 01117 2 3 - 1 = -(2 3 - 1) = -2 3 = 1010 complement all the bits 1011 and add a 1 Note: negate and invert are different! Negate

89  2004 Morgan Kaufmann Publishers Review: A Full Adder 1-bit Full Adder A B S carry_in carry_out S = A  B  carry_in (odd parity function) carry_out = A&B | A&carry_in | B&carry_in (majority function)  How can we use it to build a 32-bit adder?  How can we modify it easily to build an adder/subtractor? ABcarry_incarry_outS 00000 00101 01001 01110 10001 10110 11010 11111

90  2004 Morgan Kaufmann Publishers Just like in grade school (carry/borrow 1s) 0111 0111 0110 + 0110- 0110- 0101 Two's complement operations easy –subtraction using addition of negative numbers 0111 + 1010 Overflow (result too large for finite computer word): –e.g., adding two n-bit numbers does not yield an n-bit number 0111 + 0001 note that overflow term is somewhat misleading, 1000 it does not mean a carry “overflowed” Addition & Subtraction

91  2004 Morgan Kaufmann Publishers A 32-bit Ripple Carry Adder/Subtractor  Remember 2’s complement is just complement all the bits add a 1 in the least significant bit A 0111  0111 B - 0110  + 1-bit FA S0S0 c 0 =carry_in c1c1 1-bit FA S1S1 c2c2 S2S2 c3c3 c 32 =carry_out 1-bit FA S 31 c 31... A0A0 A1A1 A2A2 A 31 B0B0 B1B1 B2B2 B 31 add/sub B0B0 control (0=add,1=sub) B 0 if control = 0, !B 0 if control = 1

92  2004 Morgan Kaufmann Publishers A 32-bit Ripple Carry Adder/Subtractor  Remember 2’s complement is just complement all the bits add a 1 in the least significant bit A 0111  0111 B - 0110  + 1-bit FA S0S0 c 0 =carry_in c1c1 1-bit FA S1S1 c2c2 S2S2 c3c3 c 32 =carry_out 1-bit FA S 31 c 31... A0A0 A1A1 A2A2 A 31 B0B0 B1B1 B2B2 B 31 add/sub B0B0 control (0=add,1=sub) B 0 if control = 0, !B 0 if control = 1 0001 1001 1 1 0001

93  2004 Morgan Kaufmann Publishers No overflow when adding a positive and a negative number No overflow when signs are the same for subtraction Overflow occurs when the value affects the sign: –overflow when adding two positives yields a negative –or, adding two negatives gives a positive –or, subtract a negative from a positive and get a negative –or, subtract a positive from a negative and get a positive Consider the operations A + B, and A – B –Can overflow occur if B is 0 ? –Can overflow occur if A is 0 ? Detecting Overflow

94  2004 Morgan Kaufmann Publishers Overflow Detection Overflow: the result is too large to represent in 32 bits Overflow occurs when –adding two positives yields a negative –or, adding two negatives gives a positive –or, subtract a negative from a positive gives a negative –or, subtract a positive from a negative gives a positive On your own: Prove you can detect overflow by: –Carry into MSB xor Carry out of MSB, ex for 4 bit signed numbers 0111 0011+ 7 3 1100 1011+ –4 – 5

95  2004 Morgan Kaufmann Publishers Overflow Detection Overflow: the result is too large to represent in 32 bits Overflow occurs when –adding two positives yields a negative –or, adding two negatives gives a positive –or, subtract a negative from a positive gives a negative –or, subtract a positive from a negative gives a positive On your own: Prove you can detect overflow by: –Carry into MSB xor Carry out of MSB, ex for 4 bit signed numbers 1 1 11 0 1 0 1 1 0 0111 0011+ 7 3 0 1 – 6 1100 1011+ –4 – 5 7 1 0

96  2004 Morgan Kaufmann Publishers Need to support the logic operation ( and,nor,or,xor ) –Bit wise operations (no carry operation involved) –Need a logic gate for each function, mux to choose the output Need to support the set-on-less-than instruction ( slt ) –Use subtraction to determine if (a – b) < 0 (implies a < b) –Copy the sign bit into the low order bit of the result, set remaining result bits to 0 Need to support test for equality ( bne, beq ) –Again use subtraction: (a - b) = 0 implies a = b –Additional logic to “nor” all result bits together Immediates are sign extended outside the ALU with wiring (i.e., no logic needed) Tailoring the ALU to the MIPS ISA

97  2004 Morgan Kaufmann Publishers Shift Operations Also need operations to pack and unpack 8-bit characters into 32-bit words Shifts move all the bits in a word left or right sll $t2, $s0, 8 #$t2 = $s0 << 8 bits srl $t2, $s0, 8 #$t2 = $s0 >> 8 bits op rs rt rd shamt funct Notice that a 5-bit shamt field is enough to shift a 32-bit value 2 5 – 1 or 31 bit positions Such shifts are logical because they fill with zeros

98  2004 Morgan Kaufmann Publishers Shift Operations, con’t An arithmetic shift ( sra ) maintain the arithmetic correctness of the shifted value (i.e., a number shifted right one bit should be ½ of its original value; a number shifted left should be 2 times its original value) –so sra uses the most significant bit (sign bit) as the bit shifted in –note that there is no need for a sla when using two’s complement number representation sra $t2, $s0, 8 #$t2 = $s0 >> 8 bits The shift operation is implemented by hardware separate from the ALU –using a barrel shifter (which would takes lots of gates in discrete logic, but is pretty easy to implement in VLSI)

99  2004 Morgan Kaufmann Publishers Multiply Binary multiplication is just a bunch of right shifts and adds multiplicand multiplier partial product array double precision product n 2n n can be formed in parallel and added in parallel for faster multiplication

100  2004 Morgan Kaufmann Publishers Multiply produces a double precision product mult $s0, $s1 # hi||lo = $s0 * $s1 –Low-order word of the product is left in processor register lo and the high-order word is left in register hi –Instructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file MIPS Multiply Instruction op rs rt rd shamt funct Multiplies are done by fast, dedicated hardware and are much more complex (and slower) than adders Hardware dividers are even more complex and even slower; ditto for hardware square root

101  2004 Morgan Kaufmann Publishers An exception (interrupt) occurs –Control jumps to predefined address for exception –Interrupted address is saved for possible resumption Details based on software system / language –example: flight control vs. homework assignment Don't always want to detect overflow — new MIPS instructions: addu, addiu, subu note: addiu still sign-extends! note: sltu, sltiu for unsigned comparisons Effects of Overflow

102  2004 Morgan Kaufmann Publishers More complicated than addition –accomplished via shifting and addition More time and more area Let's look at 3 versions based on a gradeschool algorithm 0010 (multiplicand) __x_1011 (multiplier) Negative numbers: convert and multiply –there are better techniques, we won’t look at them Multiplication

103  2004 Morgan Kaufmann Publishers Multiplication: Implementation Datapath Control

104  2004 Morgan Kaufmann Publishers Final Version What goes here? Multiplier starts in right half of product

105  2004 Morgan Kaufmann Publishers Floating Point (a brief look) We need a way to represent –numbers with fractions, e.g., 3.1416 –very small numbers, e.g.,.000000001 –very large numbers, e.g., 3.15576  10 9 Representation: –sign, exponent, significand: (–1) sign  significand  2 exponent –more bits for significand gives more accuracy –more bits for exponent increases range IEEE 754 floating point standard: –single precision: 8 bit exponent, 23 bit significand –double precision: 11 bit exponent, 52 bit significand

106  2004 Morgan Kaufmann Publishers Representing Big (and Small) Numbers What if we want to encode the approx. age of the earth? 4,600,000,000 or 4.6 x 10 9 or the weight in kg of one a.m.u. (atomic mass unit) 0.0000000000000000000000000166 or 1.6 x 10 -27 There is no way we can encode either of the above in a 32-bit integer. Floating point representation (-1) sign x F x 2 E –Still have to fit everything in 32 bits (single precision) s E (exponent) F (fraction) 1 bit 8 bits 23 bits –The base (2, not 10) is hardwired in the design of the FPALU –More bits in the fraction (F) or the exponent (E) is a trade-off between precision (accuracy of the number) and range (size of the number)

107  2004 Morgan Kaufmann Publishers IEEE 754 floating-point standard Leading “1” bit of significand is implicit Exponent is “biased” to make sorting easier –all 0s is smallest exponent all 1s is largest –bias of 127 for single precision and 1023 for double precision –summary: (–1) sign  significand)  2 exponent – bias Example: –decimal: -.75 = - ( ½ + ¼ ) –binary: -.11 = -1.1 x 2 -1 –floating point: exponent = 126 = 01111110 –IEEE single precision: 10111111010000000000000000000000

108  2004 Morgan Kaufmann Publishers IEEE 754 FP Standard Encoding Most (all?) computers these days conform to the IEEE 754 floating point standard (-1) sign x (1+F) x 2 E-bias –Formats for both single and double precision –F is stored in normalized form where the msb in the fraction is 1 (so there is no need to store it!) – called the hidden bit –To simplify sorting FP numbers, E comes before F in the word and E is represented in excess (biased) notation Single PrecisionDouble PrecisionObject Represented E (8)F (23)E (11)F (52) 0000true zero (0) 0nonzero0 ± denormalized number ± 1-254anything± 1-2046anything± floating point number ± 2550± 20470± infinity 255nonzero2047nonzeronot a number (NaN)

109  2004 Morgan Kaufmann Publishers Floating Point Addition Addition (and subtraction) (  F1  2 E1 ) + (  F2  2 E2 ) =  F3  2 E3 –Step 1: Restore the hidden bit in F1 and in F2 –Step 1: Align fractions by right shifting F2 by E1 - E2 positions (assuming E1  E2) keeping track of (three of) the bits shifted out in a round bit, a guard bit, and a sticky bit –Step 2: Add the resulting F2 to F1 to form F3 –Step 3: Normalize F3 (so it is in the form 1.XXXXX …) If F1 and F2 have the same sign  F3  [1,4)  1 bit right shift F3 and increment E3 If F1 and F2 have different signs  F3 may require many left shifts each time decrementing E3 –Step 4: Round F3 and possibly normalize F3 again –Step 5: Rehide the most significant bit of F3 before storing the result

110  2004 Morgan Kaufmann Publishers Floating point addition

111  2004 Morgan Kaufmann Publishers MIPS Floating Point Instructions MIPS has a separate Floating Point Register File ( $f0, $f1, …, $f31 ) (whose registers are used in pairs for double precision values) with special instructions to load to and store from them lwcl $f1,54($s2) #$f1 = Memory[$s2+54] swcl $f1,58($s4) #Memory[$s4+58] = $f1 And supports IEEE 754 single add.s $f2,$f4,$f6 #$f2 = $f4 + $f6 and double precision operations add.d $f2,$f4,$f6 #$f2||$f3 = $f4||$f5 + $f6||$f7 similarly for sub.s, sub.d, mul.s, mul.d, div.s, div.d

112  2004 Morgan Kaufmann Publishers MIPS Floating Point Instructions, Con’t And floating point single precision comparison operations c.x.s $f2,$f4 #if($f2 < $f4) cond=1; else cond=0 where x may be eq, neq, lt, le, gt, ge and branch operations bclt 25 #if(cond==1) go to PC+4+25 bclf 25 #if(cond==0) go to PC+4+25 And double precision comparison operations c.x.d $f2,$f4 #$f2||$f3 < $f4||$f5 cond=1; else cond=0

113  2004 Morgan Kaufmann Publishers Floating Point Complexities Operations are somewhat more complicated (see text) In addition to overflow we can have “underflow” Accuracy can be a big problem –IEEE 754 keeps two extra bits, guard and round –four rounding modes –positive divided by zero yields “infinity” –zero divide by zero yields “not a number” –other complexities Implementing the standard can be tricky Not using the standard can be even worse –see text for description of 80x86 and Pentium bug!

114  2004 Morgan Kaufmann Publishers Chapter Three Summary Computer arithmetic is constrained by limited precision Bit patterns have no inherent meaning but standards do exist –two’s complement –IEEE 754 floating point Computer instructions determine “meaning” of the bit patterns Performance and accuracy are important so there are many complexities in real machines Algorithm choice is important and may lead to hardware optimizations for both space and time (e.g., multiplication) You may want to look back (Section 3.10 is great reading!)

116  2004 Morgan Kaufmann Publishers Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?) How does the machine's instruction set affect performance? Performance

117  2004 Morgan Kaufmann Publishers Which of these airplanes has the best performance? AirplanePassengersRange (mi)Speed (mph) Boeing 737-100101630598 Boeing 7474704150610 BAC/Sud Concorde13240001350 Douglas DC-8-501468720544 How much faster is the Concorde compared to the 747? How much bigger is the 747 than the Douglas DC-8?

118  2004 Morgan Kaufmann Publishers Response Time (latency) — How long does it take for my job to run? — How long does it take to execute a job? — How long must I wait for the database query? Throughput — How many jobs can the machine run at once? — What is the average execution rate? — How much work is getting done? If we upgrade a machine with a new processor what do we increase? If we add a new machine to the lab what do we increase? Computer Performance: TIME, TIME, TIME

119  2004 Morgan Kaufmann Publishers Elapsed Time –counts everything (disk and memory accesses, I/O, etc.) –a useful number, but often not good for comparison purposes CPU time –doesn't count I/O or time spent running other programs –can be broken up into system time, and user time Our focus: user CPU time –time spent executing the lines of code that are "in" our program Execution Time

120  2004 Morgan Kaufmann Publishers For some program running on machine X, Performance X = 1 / Execution time X "X is n times faster than Y" Performance X / Performance Y = n Problem: –machine A runs a program in 20 seconds –machine B runs the same program in 25 seconds Book's Definition of Performance

121  2004 Morgan Kaufmann Publishers Clock Cycles Instead of reporting execution time in seconds, we often use cycles Clock “ticks” indicate when to start activities (one abstraction): cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 4 Ghz. clock has a cycle time time

122  2004 Morgan Kaufmann Publishers So, to improve performance (everything else being equal) you can either (increase or decrease?) ________ the # of required cycles for a program, or ________ the clock cycle time or, said another way, ________ the clock rate. How to Improve Performance

123  2004 Morgan Kaufmann Publishers Could assume that number of cycles equals number of instructions This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code time 1st instruction2nd instruction3rd instruction4th 5th6th... How many cycles are required for a program?

124  2004 Morgan Kaufmann Publishers Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions (more later) time Different numbers of cycles for different instructions

125  2004 Morgan Kaufmann Publishers Our favorite program runs in 10 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?" Don't Panic, can easily work this out from basic principles Example

126  2004 Morgan Kaufmann Publishers A given program will require –some number of instructions (machine instructions) –some number of cycles –some number of seconds We have a vocabulary that relates these quantities: –cycle time (seconds per cycle) –clock rate (cycles per second) –CPI (cycles per instruction) a floating point intensive application might have a higher CPI –MIPS (millions of instructions per second) this would be higher for a program using simple instructions Now that we understand cycles

127  2004 Morgan Kaufmann Publishers Performance Performance is determined by execution time Do any of the other variables equal performance? –# of cycles to execute program? –# of instructions in program? –# of cycles per second? –average # of cycles per instruction? –average # of instructions per second? Common pitfall: thinking one of the variables is indicative of performance when it really isn’t.

128  2004 Morgan Kaufmann Publishers Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 What machine is faster for this program, and by how much? If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical? CPI Example

129  2004 Morgan Kaufmann Publishers A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence? # of Instructions Example

130  2004 Morgan Kaufmann Publishers Two different compilers are being tested for a 4 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time? MIPS example

131  2004 Morgan Kaufmann Publishers Performance best determined by running a real application –Use programs typical of expected workload –Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks –nice for architects and designers –easy to standardize –can be abused SPEC (System Performance Evaluation Cooperative) –companies have agreed on a set of real program and inputs –valuable indicator of performance (and compiler technology) –can still be abused Benchmarks

132  2004 Morgan Kaufmann Publishers Benchmark Games An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding error…was a sad commentary on a common industry practice of “cheating” on standardized performance tests…The error was pointed out to Intel two days ago by a competitor, Motorola …came in a test known as SPECint92…Intel acknowledged that it had “optimized” its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing…At the heart of Intel’s problem is the practice of “tuning” compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code… Saturday, January 6, 1996 New York Times

133  2004 Morgan Kaufmann Publishers SPEC ‘89 Compiler “enhancements” and performance

134  2004 Morgan Kaufmann Publishers SPEC CPU2000

135  2004 Morgan Kaufmann Publishers SPEC 2000 Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better performance?

136  2004 Morgan Kaufmann Publishers Experiment Phone a major computer retailer and tell them you are having trouble deciding between two different computers, specifically you are confused about the processors strengths and weaknesses (e.g., Pentium 4 at 2Ghz vs. Celeron M at 1.4 Ghz ) What kind of response are you likely to get? What kind of response could you give a friend with the same question?

137  2004 Morgan Kaufmann Publishers Execution Time After Improvement = Execution Time Unaffected +( Execution Time Affected / Amount of Improvement ) Example: "Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" How about making it 5 times faster? Principle: Make the common case fast Amdahl's Law

138  2004 Morgan Kaufmann Publishers Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions? We are looking for a benchmark to show off the new floating-point unit described above, and want the overall benchmark to show a speedup of 3. One benchmark we are considering runs for 100 seconds with the old floating-point hardware. How much of the execution time would floating- point instructions have to account for in this program in order to yield our desired speedup on this benchmark? Example

139  2004 Morgan Kaufmann Publishers Performance is specific to a particular program/s –Total execution time is a consistent summary of performance For a given architecture performance increases come from: –increases in clock rate (without adverse CPI affects) –improvements in processor organization that lower CPI –compiler enhancements that lower CPI and/or instruction count –Algorithm/Language choices that affect instruction count Pitfall: expecting improvement in one aspect of a machine’s performance to affect the total performance Remember

140  2004 Morgan Kaufmann Publishers Performance Metrics Purchasing perspective –given a collection of machines, which has the best performance ? least cost ? best cost/performance? Design perspective –faced with design options, which has the best performance improvement ? least cost ? best cost/performance? Both require –basis for comparison –metric for evaluation Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

141  2004 Morgan Kaufmann Publishers Defining (Speed) Performance Normally interested in reducing –Response time (aka execution time) – the time between the start and the completion of a task Important to individual users –Thus, to maximize performance, need to minimize execution time –Throughput – the total amount of work done in a given time Important to data center managers –Decreasing response time almost always improves throughput performance X = 1 / execution_time X If X is n times faster than Y, then performance X execution_time Y -------------------- = --------------------- = n performance Y execution_time X

142  2004 Morgan Kaufmann Publishers Performance Factors Want to distinguish elapsed time and the time spent on our task CPU execution time (CPU time) – time the CPU spends working on a task –Does not include time waiting for I/O or running other programs CPU execution time # CPU clock cycles for a program for a program = x clock cycle time CPU execution time # CPU clock cycles for a program for a program clock rate = ------------------------------------------- Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program or

143  2004 Morgan Kaufmann Publishers Review: Machine Clock Rate Clock rate (MHz, GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate

144  2004 Morgan Kaufmann Publishers Clock Cycles per Instruction Not all instructions take the same amount of time to execute –One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute –A way to compare two different implementations of the same ISA # CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x CPI for this instruction class ABC CPI123

145  2004 Morgan Kaufmann Publishers Effective CPI Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging Overall effective CPI =  (CPI i x IC i ) i = 1 n –Where IC i is the count (percentage) of the number of instructions of class i executed –CPI i is the (average) number of clock cycles per instruction for that instruction class –n is the number of instruction classes The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

146  2004 Morgan Kaufmann Publishers THE Performance Equation Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle Instruction_count x CPI clock_rate CPU time = ----------------------------------------------- or These equations separate the three key factors that affect performance –Can measure the CPU execution time by running the program –The clock rate is usually given –Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details –CPI varies by instruction type and ISA implementation for which we must know the implementation details

147  2004 Morgan Kaufmann Publishers Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_c ount CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology

148  2004 Morgan Kaufmann Publishers Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_c ount CPIclock_cycle Algorithm Programming language Compiler ISA Processor organization Technology X XX XX XX X X X X X

149  2004 Morgan Kaufmann Publishers A Simple Example How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1. Load20%5 Store10%3 Branch20%2  =

150  2004 Morgan Kaufmann Publishers A Simple Example How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? OpFreqCPI i Freq x CPI i ALU50%1 Load20%5 Store10%3 Branch20%2  =.5 1.0.3.4 2.2 CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster 1.6.5.4.3.4.5 1.0.3.2 2.0 CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster.25 1.0.3.4 1.95 CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster

151  2004 Morgan Kaufmann Publishers Comparing and Summarizing Performance Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) How do we summarize the performance for benchmark set with a single number? –The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) AM = 1/n  Time i i = 1 n –Where Time i is the execution time for the i th program of a total of n programs in the workload –A smaller mean indicates a smaller average execution time and thus improved performance

152  2004 Morgan Kaufmann Publishers SPEC Benchmarks www.spec.orgwww.spec.org Integer benchmarksFP benchmarks gzipcompressionwupwiseQuantum chromodynamics vprFPGA place & routeswimShallow water model gccGNU C compilermgridMultigrid solver in 3D fields mcfCombinatorial optimizationappluParabolic/elliptic pde craftyChess programmesa3D graphics library parserWord processing programgalgelComputational fluid dynamics eonComputer visualizationartImage recognition (NN) perlbmkperl applicationequakeSeismic wave propagation simulation gapGroup theory interpreterfacerecFacial image recognition vortexObject oriented databaseammpComputational chemistry bzip2compressionlucasPrimality testing twolfCircuit place & routefma3dCrash simulation fem sixtrackNuclear physics accel apsiPollutant distribution

153  2004 Morgan Kaufmann Publishers Example SPEC Ratings

154  2004 Morgan Kaufmann Publishers Other Performance Metrics Power consumption – especially in the embedded market where battery life is important (and passive cooling) –For power-limited applications, the most important metric is energy efficiency

155  2004 Morgan Kaufmann Publishers Summary: Evaluating ISAs Design-time metrics: –Can it be implemented, in how long, at what cost? –Can it be programmed? Ease of compilation? Static Metrics: –How many bytes does the program occupy in memory? Dynamic Metrics: –How many instructions are executed? How many bytes does the processor fetch to execute the program? –How many clocks are required per instruction? –How "lean" a clock is practical? Best Metric: Time to execute the program! CPI Inst. CountCycle Time depends on the instructions set, the processor organization, and compilation techniques.

156  2004 Morgan Kaufmann Publishers Chapter --Five

157  2004 Morgan Kaufmann Publishers Lets Build a Processor Almost ready to move into chapter 5 and start building a processor First, let’s review Boolean Logic and build the ALU we’ll need (Material from Appendix B) 32 operation result a b ALU

158  2004 Morgan Kaufmann Publishers Problem: Consider a logic function with three inputs: A, B, and C. Output D is true if at least one input is true Output E is true if exactly two inputs are true Output F is true only if all three inputs are true Show the truth table for these three functions. Show the Boolean equations for these three functions. Show an implementation consisting of inverters, AND, and OR gates. Review: Boolean Algebra & Gates

159  2004 Morgan Kaufmann Publishers Let's build an ALU to support the andi and ori instructions –we'll just build a 1 bit ALU, and use 32 of them Possible Implementation (sum-of-products): b a operation result opabres An ALU (arithmetic logic unit)

160  2004 Morgan Kaufmann Publishers Selects one of the inputs to be the output, based on a control input Lets build our ALU using a MUX: S C A B 0 1 Review: The Multiplexor note: we call this a 2-input mux even though it has 3 inputs!

161  2004 Morgan Kaufmann Publishers Not easy to decide the “best” way to build something –Don't want too many inputs to a single gate –Don’t want to have to go through too many gates –for our purposes, ease of comprehension is important Let's look at a 1-bit ALU for addition: How could we build a 1-bit ALU for add, and, and or? How could we build a 32-bit ALU? Different Implementations c out = a b + a c in + b c in sum = a xor b xor c in

162  2004 Morgan Kaufmann Publishers Building a 32 bit ALU

163  2004 Morgan Kaufmann Publishers Two's complement approach: just negate b and add. How do we negate? A very clever solution: What about subtraction (a – b) ?

164  2004 Morgan Kaufmann Publishers Adding a NOR function Can also choose to invert a. How do we get “a NOR b” ?

165  2004 Morgan Kaufmann Publishers Need to support the set-on-less-than instruction (slt) –remember: slt is an arithmetic instruction –produces a 1 if rs < rt and 0 otherwise –use subtraction: (a-b) < 0 implies a < b Need to support test for equality (beq $t5, $t6, $t7) –use subtraction: (a-b) = 0 implies a = b Tailoring the ALU to the MIPS

Supporting slt Can we figure out the idea? Use this ALU for most significant bit all other bits

167  2004 Morgan Kaufmann Publishers Supporting slt

168  2004 Morgan Kaufmann Publishers Test for equality Notice control lines: 0000 = and 0001 = or 0010 = add 0110 = subtract 0111 = slt 1100 = NOR Note: zero is a 1 when the result is zero!

169  2004 Morgan Kaufmann Publishers Conclusion We can build an ALU to support the MIPS instruction set –key idea: use multiplexor to select the output we want –we can efficiently perform subtraction using two’s complement –we can replicate a 1-bit ALU to produce a 32-bit ALU Important points about hardware –all of the gates are always working –the speed of a gate is affected by the number of inputs to the gate –the speed of a circuit is affected by the number of gates in series (on the “critical path” or the “deepest level of logic”) Our primary focus: comprehension, however, –Clever changes to organization can improve performance (similar to using better algorithms in software) –We saw this in multiplication, let’s look at addition now

170  2004 Morgan Kaufmann Publishers Is a 32-bit ALU as fast as a 1-bit ALU? Is there more than one way to do addition? –two extremes: ripple carry and sum-of-products Can you see the ripple? How could you get rid of it? c 1 = b 0 c 0 + a 0 c 0 + a 0 b 0 c 2 = b 1 c 1 + a 1 c 1 + a 1 b 1 c 2 = c 3 = b 2 c 2 + a 2 c 2 + a 2 b 2 c 3 = c 4 = b 3 c 3 + a 3 c 3 + a 3 b 3 c 4 = Not feasible! Why? Problem: ripple carry adder is slow

171  2004 Morgan Kaufmann Publishers An approach in-between our two extremes Motivation: – If we didn't know the value of carry-in, what could we do? –When would we always generate a carry? g i = a i b i –When would we propagate the carry? p i = a i + b i Did we get rid of the ripple? c 1 = g 0 + p 0 c 0 c 2 = g 1 + p 1 c 1 c 2 = c 3 = g 2 + p 2 c 2 c 3 = c 4 = g 3 + p 3 c 3 c 4 = Feasible! Why? Carry-lookahead adder

172  2004 Morgan Kaufmann Publishers Can’t build a 16 bit adder this way... (too big) Could use ripple carry of 4-bit CLA adders Better: use the CLA principle again! Use principle to build bigger adders

173  2004 Morgan Kaufmann Publishers ALU Summary We can build an ALU to support MIPS addition Our focus is on comprehension, not performance Real processors use more sophisticated techniques for arithmetic Where performance is not critical, hardware description languages allow designers to completely automate the creation of hardware!

174  2004 Morgan Kaufmann Publishers Chapter Five

175  2004 Morgan Kaufmann Publishers We're ready to look at an implementation of the MIPS Simplified to contain only: –memory-reference instructions: lw, sw –arithmetic-logical instructions: add, sub, and, or, slt –control flow instructions: beq, j Generic Implementation: –use the program counter (PC) to supply instruction address –get the instruction from memory –read registers –use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow? The Processor: Datapath & Control

176  2004 Morgan Kaufmann Publishers Abstract / Simplified View: Two types of functional units: –elements that operate on data values (combinational) –elements that contain state (sequential) More Implementation Details

177  2004 Morgan Kaufmann Publishers Unclocked vs. Clocked Clocks used in synchronous logic – when should an element that contains state be updated? State Elements

178  2004 Morgan Kaufmann Publishers The set-reset latch –output depends on present inputs and also on past inputs An unclocked state element

179  2004 Morgan Kaufmann Publishers Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) "logically true", — could mean electrically low A clocking methodology defines when signals can be read and written — wouldn't want to read a signal at the same time it was being written Latches and Flip-flops

180  2004 Morgan Kaufmann Publishers Two inputs: –the data value to be stored (D) –the clock signal (C) indicating when to read & store D Two outputs: –the value of the internal state (Q) and it's complement D-latch

181  2004 Morgan Kaufmann Publishers D flip-flop Output changes only on the clock edge

182  2004 Morgan Kaufmann Publishers Our Implementation An edge triggered methodology Typical execution: –read contents of some state elements, –send values through some combinational logic –write results to one or more state elements

183  2004 Morgan Kaufmann Publishers Built using D flip-flops Register File Do you understand? What is the “Mux” above?

184  2004 Morgan Kaufmann Publishers Abstraction Make sure you understand the abstractions! Sometimes it is easy to think you do, when you don’t

185  2004 Morgan Kaufmann Publishers Register File Note: we still use the real clock to determine when to write

186  2004 Morgan Kaufmann Publishers Simple Implementation Include the functional units we need for each instruction Why do we need this stuff?

187  2004 Morgan Kaufmann Publishers Building the Datapath Use multiplexors to stitch them together

188  2004 Morgan Kaufmann Publishers Control Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction Example: add $8, $17, $18 Instruction Format: 000000 10001 10010 01000 00000100000 op rs rt rd shamt funct ALU's operation based on instruction type and function code

189  2004 Morgan Kaufmann Publishers e.g., what should the ALU do with this instruction Example: lw $1, 100($2) 35 2 1 100 op rs rt 16 bit offset ALU control input 0000 AND 0001OR 0010add 0110subtract 0111set-on-less-than 1100NOR Why is the code for subtract 0110 and not 0011? Control

190  2004 Morgan Kaufmann Publishers Must describe hardware to compute 4-bit ALU control input –given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic –function code for arithmetic Describe it using a truth table (can turn into gates): ALUOp computed from instruction type Control

192  2004 Morgan Kaufmann Publishers Control Simple combinational logic (truth tables)

193  2004 Morgan Kaufmann Publishers All of the logic is combinational We wait for everything to settle down, and the right thing to be done –ALU might not produce “right answer” right away –we use write signals along with clock to determine when to write Cycle time determined by length of the longest path Our Simple Control Structure We are ignoring some details like setup and hold times

194  2004 Morgan Kaufmann Publishers Single Cycle Implementation Calculate cycle time assuming negligible delays except: –memory (200ps), ALU and adders (100ps), register file access (50ps)

195  2004 Morgan Kaufmann Publishers Where we are headed Single Cycle Problems: –what if we had a more complicated instruction like floating point? –wasteful of area One Solution: –use a “smaller” cycle time –have different instructions take different numbers of cycles –a “multicycle” datapath:

196  2004 Morgan Kaufmann Publishers We will be reusing functional units –ALU used to compute address and to increment PC –Memory used for instruction and data Our control signals will not be determined directly by instruction –e.g., what should the ALU do for a “subtract” instruction? We’ll use a finite state machine for control Multicycle Approach

197  2004 Morgan Kaufmann Publishers Break up the instructions into steps, each step takes a cycle –balance the amount of work to be done –restrict each cycle to use only one major functional unit At the end of a cycle –store values for use in later cycles (easiest thing to do) –introduce additional “internal” registers Multicycle Approach

198  2004 Morgan Kaufmann Publishers Instructions from ISA perspective Consider each instruction from perspective of ISA. Example: –The add instruction changes a register. –Register specified by bits 15:11 of instruction. –Instruction specified by the PC. –New value is the sum (“op”) of two registers. –Registers specified by bits 25:21 and 20:16 of the instruction Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] –In order to accomplish this we must break up the instruction. (kind of like introducing variables when programming)

199  2004 Morgan Kaufmann Publishers Breaking down an instruction ISA definition of arithmetic: Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] Could break down to: –IR <= Memory[PC] –A <= Reg[IR[25:21]] –B <= Reg[IR[20:16]] –ALUOut <= A op B –Reg[IR[20:16]] <= ALUOut We forgot an important part of the definition of arithmetic! –PC <= PC + 4

200  2004 Morgan Kaufmann Publishers Idea behind multicycle approach We define each instruction from the ISA perspective (do this!) Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps) Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution) Result: Our book’s multicycle Implementation!

201  2004 Morgan Kaufmann Publishers Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! Five Execution Steps

202  2004 Morgan Kaufmann Publishers Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC. Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now? Step 1: Instruction Fetch

203  2004 Morgan Kaufmann Publishers Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0]) << 2); We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) Step 2: Instruction Decode and Register Fetch

204  2004 Morgan Kaufmann Publishers ALU is performing one of three functions, based on instruction type Memory Reference: ALUOut <= A + sign-extend(IR[15:0]); R-type: ALUOut <= A op B; Branch: if (A==B) PC <= ALUOut; Step 3 (instruction dependent)

205  2004 Morgan Kaufmann Publishers Loads and stores access memory MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B; R-type instructions finish Reg[IR[15:11]] <= ALUOut; The write actually takes place at the end of the cycle on the edge Step 4 (R-type or memory-access)

206  2004 Morgan Kaufmann Publishers Reg[IR[20:16]] <= MDR; Which instruction needs this? Write-back step

207  2004 Morgan Kaufmann Publishers Summary:

208  2004 Morgan Kaufmann Publishers How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label#assume not add $t5, $t2, $t3 sw $t5, 8($t3) Label:... What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place? Simple Questions

210  2004 Morgan Kaufmann Publishers Finite state machines: –a set of states and –next state function (determined by current state and the input) –output function (determined by current state and possibly input) –We’ll use a Moore machine (output based only on current state) Review: finite state machines

211  2004 Morgan Kaufmann Publishers Review: finite state machines Example: B. 37 A friend would like you to build an “electronic eye” for use as a fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, Middle, and Right, which, if asserted, indicate that a light should be on. Only one light is on at a time, and the light “moves” from left to right and then from right to left, thus scaring away thieves who believe that the device is monitoring their activity. Draw the graphical representation for the finite state machine used to specify the electronic eye. Note that the rate of the eye’s movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs.

212  2004 Morgan Kaufmann Publishers Value of control signals is dependent upon: –what instruction is being executed –which step is being performed Use the information we’ve accumulated to specify a finite state machine –specify the finite state machine graphically, or –use microprogramming Implementation can be derived from specification Implementing the Control

213  2004 Morgan Kaufmann Publishers Note: –don’t care if not mentioned –asserted if name only –otherwise exact value How many state bits will we need? Graphical Specification of FSM

214  2004 Morgan Kaufmann Publishers Implementation: Finite State Machine for Control

215  2004 Morgan Kaufmann Publishers PLA Implementation If I picked a horizontal or vertical line could you explain it?

216  2004 Morgan Kaufmann Publishers ROM = "Read Only Memory" –values of memory locations are fixed ahead of time A ROM can be used to implement a truth table –if the address is m-bits, we can address 2 m entries in the ROM. –our outputs are the bits of data that the address points to. m is the "height", and n is the "width" ROM Implementation mn 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1

217  2004 Morgan Kaufmann Publishers How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 2 10 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs ROM is 2 10 x 20 = 20K bits (and a rather unusual size) Rather wasteful, since for lots of the entries, the outputs are the same — i.e., opcode is often ignored ROM Implementation

218  2004 Morgan Kaufmann Publishers Break up the table into two parts — 4 state bits tell you the 16 outputs, 2 4 x 16 bits of ROM — 10 bits tell you the 4 next state bits, 2 10 x 4 bits of ROM — Total: 4.3K bits of ROM PLA is much smaller — can share product terms — only need entries that produce an active output — can take into account don't cares Size is (#inputs  #product-terms) + (#outputs  #product-terms) For this example = (10x17)+(20x17) = 510 PLA cells PLA cells usually about the size of a ROM cell (slightly bigger) ROM vs PLA

219  2004 Morgan Kaufmann Publishers Complex instructions: the "next state" is often current state + 1 Another Implementation Style

220  2004 Morgan Kaufmann Publishers Details

221  2004 Morgan Kaufmann Publishers Microprogramming What are the “microinstructions” ?

222  2004 Morgan Kaufmann Publishers A specification methodology –appropriate if hundreds of opcodes, modes, cycles, etc. –signals specified symbolically using microinstructions Will two implementations of the same architecture have the same microcode? What would a microassembler do? Microprogramming

223  2004 Morgan Kaufmann Publishers Microinstruction format

224  2004 Morgan Kaufmann Publishers No encoding: –1 bit for each datapath operation –faster, requires more memory (logic) –used for Vax 780 — an astonishing 400K of memory! Lots of encoding: –send the microinstructions through logic to get control signals –uses less memory, slower Historical context of CISC: –Too much logic to put on a single chip with everything else –Use a ROM (or even RAM) to hold the microcode –It’s easy to add new instructions Maximally vs. Minimally Encoded

225  2004 Morgan Kaufmann Publishers Microcode: Trade-offs Distinction between specification and implementation is sometimes blurred Specification Advantages: –Easy to design and write –Design architecture and microcode in parallel Implementation (off-chip ROM) Advantages –Easy to change since values are in memory –Can emulate other architectures –Can make use of internal registers Implementation Disadvantages, SLOWER now that: –Control is implemented on same chip as processor –ROM is no longer faster than RAM –No need to go back and make changes

226  2004 Morgan Kaufmann Publishers Historical Perspective In the ‘60s and ‘70s microprogramming was very important for implementing machines This led to more sophisticated ISAs and the VAX In the ‘80s RISC processors based on pipelining became popular Pipelining the microinstructions is also possible! Implementations of IA-32 architecture processors since 486 use: –“hardwired control” for simpler instructions (few cycles, FSM control implemented using PLA or random logic) –“microcoded control” for more complex instructions (large numbers of cycles, central control store) The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store

227  2004 Morgan Kaufmann Publishers Pentium 4 Pipelining is important (last IA-32 without it was 80386 in 1985) Pipelining is used for the simple instructions favored by compilers “Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions” Chapter 6 Chapter 7

228  2004 Morgan Kaufmann Publishers Pentium 4 Somewhere in all that “control we must handle complex instructions Processor executes simple microinstructions, 70 bits wide (hardwired) 120 control lines for integer datapath (400 for floating point) If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions) Its complicated!

229  2004 Morgan Kaufmann Publishers Chapter 5 Summary If we understand the instructions… We can build a simple processor! If instructions take different amounts of time, multi-cycle is better Datapath implemented using: –Combinational logic for arithmetic –State holding elements to remember bits Control implemented using: –Combinational logic for single-cycle implementation –Finite state machine for multi-cycle implementation

230  2004 Morgan Kaufmann Publishers Chapter Six

231  2004 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this? Note: timing assumptions changed for this example

232  2004 Morgan Kaufmann Publishers Pipelining What makes it easy –all instructions are the same length –just a few instruction formats –memory operands appear only in loads and stores What makes it hard? –structural hazards: suppose we had only one memory –control hazards: need to worry about branch instructions –data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: –exception handling –trying to improve performance with out-of-order execution, etc.

233  2004 Morgan Kaufmann Publishers Basic Idea What do we need to add to actually split the datapath into stages?

234  2004 Morgan Kaufmann Publishers Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

235  2004 Morgan Kaufmann Publishers Corrected Datapath

236  2004 Morgan Kaufmann Publishers Graphically Representing Pipelines Can help with answering questions like: –how many cycles does it take to execute this code? –what is the ALU doing during cycle 4? –use this representation to help understand datapaths

237  2004 Morgan Kaufmann Publishers Pipeline Control

238  2004 Morgan Kaufmann Publishers We have 5 stages. What needs to be controlled in each stage? –Instruction Fetch and PC Increment –Instruction Decode / Register Fetch –Execution –Memory Stage –Write Back How would control be handled in an automobile plant? –a fancy control center telling everyone what to do? –should we use a finite state machine? Pipeline control

239  2004 Morgan Kaufmann Publishers Pass control signals along just like the data Pipeline Control

240  2004 Morgan Kaufmann Publishers Datapath with Control

241  2004 Morgan Kaufmann Publishers Problem with starting next instruction before first is finished –dependencies that “go backward in time” are data hazards Dependencies

242  2004 Morgan Kaufmann Publishers Have compiler guarantee no hazards Where do we insert the “nops” ? sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Problem: this really slows us down! Software Solution

243  2004 Morgan Kaufmann Publishers Use temporary results, don’t wait for them to be written –register file forwarding to handle read/write to same register –ALU forwarding Forwarding what if this $2 was $13?

244  2004 Morgan Kaufmann Publishers Forwarding The main idea (some details not shown)

245  2004 Morgan Kaufmann Publishers Load word can still cause a hazard: –an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to “stall” the load instruction Can't always forward

246  2004 Morgan Kaufmann Publishers Stalling We can stall the pipeline by keeping an instruction in the same stage

247  2004 Morgan Kaufmann Publishers Hazard Detection Unit Stall by letting an instruction that won’t write anything go forward

248  2004 Morgan Kaufmann Publishers When we decide to branch, other instructions are in the pipeline! We are predicting “branch not taken” –need to add hardware for flushing instructions if we are wrong Branch Hazards

249  2004 Morgan Kaufmann Publishers Flushing Instructions Note: we’ve also moved branch decision to ID stage

250  2004 Morgan Kaufmann Publishers Branches If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction A 2-bit prediction scheme

251  2004 Morgan Kaufmann Publishers Branch Prediction Sophisticated Techniques: –A “branch target buffer” to help us look up the destination –Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) –Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. –A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! Modern processors predict correctly 95% of the time!

252  2004 Morgan Kaufmann Publishers Improving Performance Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Dynamic Pipeline Scheduling –Hardware chooses which instructions to execute next –Will execute instructions out of order (e.g., doesn’t wait for a dependency to be resolved, but rather keeps going!) –Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) Trying to exploit instruction-level parallelism

253  2004 Morgan Kaufmann Publishers Advanced Pipelining Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) “Superscalar” processors –DEC Alpha 21264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different “pipes”) VLIW: very long instruction word, static multiple issue (relies more on compiler technology) This class has given you the background you need to learn more!

254  2004 Morgan Kaufmann Publishers Chapter 6 Summary Pipelining does not improve latency, but does improve throughput

255  2004 Morgan Kaufmann Publishers Chapter Seven

256  2004 Morgan Kaufmann Publishers SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge on capacitor (must be refreshed) –very small but slower than SRAM (factor of 5 to 10) Memories: Review

257  2004 Morgan Kaufmann Publishers Users want large and fast memories! SRAM access times are.5 – 5ns at cost of $4000 to $10,000 per GB. DRAM access times are 50-70ns at cost of $100 to $200 per GB. Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB. Try and give it to them anyway –build a memory hierarchy Exploiting Memory Hierarchy 2004

258  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? Our initial focus: two levels (upper, lower) –block: minimum unit of data –hit: data requested is in the upper level –miss: data requested is not in the upper level

259  2004 Morgan Kaufmann Publishers Two issues: –How do we know if a data item is in the cache? –If it is, how do we find it? Our first example: – block size is one word of data – "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level Cache

260  2004 Morgan Kaufmann Publishers Mapping: address is modulo the number of blocks in the cache Direct Mapped Cache

261  2004 Morgan Kaufmann Publishers For MIPS: What kind of locality are we taking advantage of? Direct Mapped Cache

262  2004 Morgan Kaufmann Publishers Taking advantage of spatial locality: Direct Mapped Cache

263  2004 Morgan Kaufmann Publishers Read hits –this is what we want! Read misses –stall the CPU, fetch block from memory, deliver to cache, restart Write hits: –can replace data in cache and memory (write-through) –write the data only into the cache (write-back the cache later) Write misses: –read the entire block into the cache, then write the word Hits vs. Misses

264  2004 Morgan Kaufmann Publishers Make reading multiple words easier by using banks of memory It can get a lot more complicated... Hardware Issues

265  2004 Morgan Kaufmann Publishers Increasing the block size tends to decrease miss rate: Use split caches because there is more spatial locality in code: Performance

266  2004 Morgan Kaufmann Publishers Performance Simplified model: execution time = (execution cycles + stall cycles)  cycle time stall cycles = # of instructions  miss ratio  miss penalty Two ways of improving performance: –decreasing the miss ratio –decreasing the miss penalty What happens if we increase block size?

267  2004 Morgan Kaufmann Publishers Compared to direct mapped, give a series of references that: –results in a lower miss ratio using a 2-way set associative cache –results in a higher miss ratio using a 2-way set associative cache assuming we use the “least recently used” replacement strategy Decreasing miss ratio with associativity

268  2004 Morgan Kaufmann Publishers An implementation

269  2004 Morgan Kaufmann Publishers Performance

270  2004 Morgan Kaufmann Publishers Decreasing miss penalty with multilevel caches Add a second level cache: –often primary cache is on the same chip as the processor –use SRAMs to add another cache above primary memory (DRAM) –miss penalty goes down if data is in 2nd level cache Example: –CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access –Adding 2nd level cache with 5ns access time decreases miss rate to.5% Using multilevel caches: –try and optimize the hit time on the 1st level cache –try and optimize the miss rate on the 2nd level cache

271  2004 Morgan Kaufmann Publishers Cache Complexities Not always easy to understand implications of caches: Theoretical behavior of Radix sort vs. Quicksort Observed behavior of Radix sort vs. Quicksort

272  2004 Morgan Kaufmann Publishers Cache Complexities Here is why: Memory system performance is often critical factor –multilevel caches, pipelined processors, make it harder to predict outcomes –Compiler optimizations to increase locality sometimes hurt ILP Difficult to predict best algorithm: need experimental data

273  2004 Morgan Kaufmann Publishers Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation –protection

274  2004 Morgan Kaufmann Publishers Pages: virtual memory blocks Page faults: the data is not in memory, retrieve it from disk –huge miss penalty, thus pages should be fairly large (e.g., 4KB) –reducing page faults is important (LRU is worth the price) –can handle the faults in software instead of hardware –using write-through is too expensive so we use writeback

275  2004 Morgan Kaufmann Publishers Page Tables

276  2004 Morgan Kaufmann Publishers Page Tables

277  2004 Morgan Kaufmann Publishers Making Address Translation Fast A cache for address translations: translation lookaside buffer Typical values: 16-512 entries, miss-rate:.01% - 1% miss-penalty: 10 – 100 cycles

278  2004 Morgan Kaufmann Publishers TLBs and caches

279  2004 Morgan Kaufmann Publishers TLBs and Caches

280  2004 Morgan Kaufmann Publishers Modern Systems

281  2004 Morgan Kaufmann Publishers Modern Systems Things are getting complicated!

282  2004 Morgan Kaufmann Publishers Processor speeds continue to increase very fast — much faster than either DRAM or disk access times Design challenge: dealing with this growing disparity –Prefetching? 3 rd level caches and more? Memory design? Some Issues

283  2004 Morgan Kaufmann Publishers Chapters 8 & 9 (partial coverage)

284  2004 Morgan Kaufmann Publishers Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection between devices and the system — the memory hierarchy — the operating system A variety of different users (e.g., banks, supercomputers, engineers)

285  2004 Morgan Kaufmann Publishers I/O Important but neglected “The difficulties in assessing and designing I/O systems have often relegated I/O to second class status” “courses in every aspect of computing, from programming to computer architecture often ignore I/O or give it scanty coverage” “textbooks leave the subject to near the end, making it easier for students and instructors to skip it!” GUILTY! — we won’t be looking at I/O in much detail — be sure and read Chapter 8 in its entirety. — you should probably take a networking class!

286  2004 Morgan Kaufmann Publishers I/O Devices Very diverse devices — behavior (i.e., input vs. output) — partner (who is at the other end?) — data rate

287  2004 Morgan Kaufmann Publishers I/O Example: Disk Drives To access data: — seek: position head over the proper track (3 to 14 ms. avg.) — rotational latency: wait for desired sector (.5 / RPM) — transfer: grab the data (one or more sectors) 30 to 80 MB/sec

288  2004 Morgan Kaufmann Publishers I/O Example: Buses Shared communication link (one or more wires) Difficult design: — may be bottleneck — length of the bus — number of devices — tradeoffs (buffers for higher bandwidth increases latency) — support for many different devices — cost Types of buses: — processor-memory (short high speed, custom design) — backplane (high speed, often standardized, e.g., PCI) — I/O (lengthy, different devices, e.g., USB, Firewire) Synchronous vs. Asynchronous — use a clock and a synchronous protocol, fast and small but every device must operate at same rate and clock skew requires the bus to be short — don’t use a clock and instead use handshaking

289  2004 Morgan Kaufmann Publishers I/O Bus Standards Today we have two dominant bus standards:

290  2004 Morgan Kaufmann Publishers Other important issues Bus Arbitration: — daisy chain arbitration (not very fair) — centralized arbitration (requires an arbiter), e.g., PCI — collision detection, e.g., Ethernet Operating system: — polling — interrupts — direct memory access (DMA) Performance Analysis techniques: — queuing theory — simulation — analysis, i.e., find the weakest link (see “I/O System Design”) Many new developments

291  2004 Morgan Kaufmann Publishers Pentium 4 I/O Options

292  2004 Morgan Kaufmann Publishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never fail. Fallacy: magnetic disk storage is on its last legs, will be replaced. Fallacy: A 100 MB/sec bus can transfer 100 MB/sec. Pitfall: Moving functions from the CPU to the I/O processor, expecting to improve performance without analysis.

293  2004 Morgan Kaufmann Publishers Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad news: its really hard to write good concurrent programs many commercial failures

294  2004 Morgan Kaufmann Publishers Questions How do parallel processors share data? — single address space (SMP vs. NUMA) — message passing How do parallel processors coordinate? — synchronization (locks, semaphores) — built into send / receive primitives — operating system protocols How are they implemented? — connected by a single bus — connected by a network

295  2004 Morgan Kaufmann Publishers Supercomputers Plot of top 500 supercomputer sites over a decade:

296  2004 Morgan Kaufmann Publishers Using multiple processors an old idea Some SIMD designs: Costs for the the Illiac IV escalated from $8 million in 1966 to $32 million in 1972 despite completion of only ¼ of the machine. It took three more years before it was operational! “For better or worse, computer architects are not easily discouraged” Lots of interesting designs and ideas, lots of failures, few successes

297  2004 Morgan Kaufmann Publishers Topologies

298  2004 Morgan Kaufmann Publishers Clusters Constructed from whole computers Independent, scalable networks Strengths: –Many applications amenable to loosely coupled machines –Exploit local area networks –Cost effective / Easy to expand Weaknesses: –Administration costs not necessarily lower –Connected using I/O bus Highly available due to separation of memories In theory, we should be able to do better

299  2004 Morgan Kaufmann Publishers Google Serve an average of 1000 queries per second Google uses 6,000 processors and 12,000 disks Two sites in silicon valley, two in Virginia Each site connected to internet using OC48 (2488 Mbit/sec) Reliability: –On an average day, 20 machines need rebooted (software error) –2% of the machines replaced each year In some sense, simple ideas well executed. Better (and cheaper) than other approaches involving increased complexity

300  2004 Morgan Kaufmann Publishers Concluding Remarks Evolution vs. Revolution “More often the expense of innovation comes from being too disruptive to computer users” “Acceptance of hardware ideas requires acceptance by software people; therefore hardware people should learn about software. And if software people want good machines, they must learn more about hardware to be able to communicate with and thereby influence hardware engineers.”

1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson Course syllabus, calendar, and assignments found at

Similar presentations

Presentation on theme: "1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson Course syllabus, calendar, and assignments found at"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson Course syllabus, calendar, and assignments found at

Similar presentations

Presentation on theme: "1  2004 Morgan Kaufmann Publishers CS2810 Spring 2007 Dan Watson Course syllabus, calendar, and assignments found at"— Presentation transcript:

Similar presentations

About project

Feedback