Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter :1 Introduction

Similar presentations


Presentation on theme: "Chapter :1 Introduction"— Presentation transcript:

1 Chapter :1 Introduction
logistics why computer organization is important modern trends

2 Why Computer Organization
Yes, I know, required class…

3 Why Computer Organization
Embarrassing if you are a BS in CS/CE and can’t make sense of the following terms: DRAM, pipelining, cache hierarchies, I/O, virtual memory Embarrassing if you are a BS in CS/CE and can’t decide which processor to buy: 3 GHz P4 or 2.5 GHz Athlon (helps us reason about performance/power) Obvious first step for chip designers, compiler/OS writers Will knowledge of the hardware help me write better programs?

4 Must a Programmer Care About Hardware?
Memory management: if we understand how/where data is placed, we can help ensure that relevant data is nearby Thread management: if we understand how threads interact, we can write smarter multi-threaded programs  Why do we care about multi-threaded programs?

5 Microprocessor Performance
50% improvement every year!! What contributes to this improvement?

6 Modern Trends Historical contributions to performance:
Better processes (faster devices) ~20% Better circuits/pipelines ~15% Better organization/architecture ~15% In the future, bullet-2 will help little and bullet-3 will not help much for a single core! Pentium P-Pro P-II P-III P Itanium Montecito Year Transistors M M 7.5M 9.5M 42M M M Clock Speed M M 300M 500M M M M Moore’s Law in action At this point, adding transistors to a core yields little benefit

7 What Does This Mean to a Programmer?
In the past, a new chip directly meant 50% higher performance for a program Today, one can expect only a 20% improvement, unless… the program can be broken up into multiple threads Expect #threads to emerge as a major metric for software quality 4-way multi-core 8-way multi-core

8 Challenges for the Hardware Designers
Major concerns: The performance problem (especially scientific workloads) The power dissipation problem (especially embedded processors) The temperature problem The reliability problem

9 The HW/SW Interface a[i] = b[i] + c; Application software Compiler
lw $15, 0($2) add $16, $15, $14 add $17, $15, $13 lw $18, 0($12) lw $19, 0($17) add $20, $18, $19 sw $20, 0($16) Systems software (OS, compiler) Assembler Hardware

10 Computer Components Input/output devices
Secondary storage: non-volatile, slower, cheaper Primary storage: volatile, faster, costlier CPU/processor

11 Wafers and Dies

12 Manufacturing Process
Silicon wafers undergo many processing steps so that different parts of the wafer behave as insulators, conductors, and transistors (switches) Multiple metal layers on the silicon enable connections between transistors The wafer is chopped into many dies – the size of the die determines yield and cost

13 Processor Technology Trends
Shrinking of transistor sizes: 250nm (1997)  130nm (2002)  70nm (2008)  35nm (2014) Transistor density increases by 35% per year and die size increases by 10-20% per year… functionality improvements! Transistor speed improves linearly with size (complex equation involving voltages, resistances, capacitances) Wire delays do not scale down at the same rate as transistor delays

14 Memory and I/O Technology Trends
DRAM density increases by 40-60% per year, latency has reduced by 33% in 10 years (the memory wall!), bandwidth improves twice as fast as latency decreases Disk density improves by 100% every year, latency improvement similar to DRAM Networks: primary focus on bandwidth; 10Mb  100Mb in 10 years; 100Mb  1Gb in 5 years

15 Power Consumption Trends
Dyn power  activity x capacitance x voltage2 x frequency Capacitance per transistor and voltage are decreasing, but number of transistors and frequency are increasing at a faster rate Leakage power is also rising and will soon match dynamic power Power consumption is already around 100W in some high-performance processors today

16 Next Class Topics: MIPS instruction set architecture (Chapter 2)
Visit the class web-page Sign up for the mailing list Pick up CADE Lab passwords

17 Lectuure : 1 MIPS Instruction Set
Chapter : 2 Lectuure : 1 MIPS Instruction Set MIPS instructions

18 Recap Knowledge of hardware improves software quality:
compilers, OS, threaded programs, memory management Important trends: growing transistors, move to multi-core, slowing rate of performance improvement, power/thermal constraints, long memory/disk latencies

19 Instruction Set Understanding the language of the hardware is key to understanding the hardware/software interface A program (in say, C) is compiled into an executable that is composed of machine instructions – this executable must also run on future machines – for example, each Intel processor reads in the same x86 instructions, but each processor handles instructions differently Java programs are converted into portable bytecode that is converted into machine instructions during execution (just-in-time compilation) What are important design principles when defining the instruction set architecture (ISA)?

20 Instruction Set Important design principles when defining the
instruction set architecture (ISA): keep the hardware simple – the chip must only implement basic primitives and run fast keep the instructions regular – simplifies the decoding/scheduling of instructions

21 A Basic MIPS Instruction
C code: a = b + c ; Assembly code: (human-friendly machine instructions) add a, b, c # a is the sum of b and c Machine code: (hardware-friendly machine instructions) Translate the following C code into assembly code: a = b + c + d + e;

22 Example C code a = b + c + d + e;
translates into the following assembly code: add a, b, c add a, b, c add a, a, d or add f, d, e add a, a, e add a, a, f Instructions are simple: fixed number of operands (unlike C) A single line of C code is converted into multiple lines of assembly code Some sequences are better than others… the second sequence needs one more (temporary) variable f

23 Subtract Example C code f = (g + h) – (i + j);
Assembly code translation with only add and sub instructions:

24 Subtract Example C code f = (g + h) – (i + j);
translates into the following assembly code: add t0, g, h add f, g, h add t1, i, j or sub f, f, i sub f, t0, t sub f, f, j Each version may produce a different result because floating-point operations are not necessarily associative and commutative… more on this later

25 Operands In C, each “variable” is a location in memory
In hardware, each memory access is expensive – if variable a is accessed repeatedly, it helps to bring the variable into an on-chip scratchpad and operate on the scratchpad (registers) To simplify the instructions, we require that each instruction (add, sub) only operate on registers Note: the number of operands (variables) in a C program is very large; the number of operands in assembly is fixed… there can be only so many scratchpad registers

26 Registers The MIPS ISA has 32 registers (x86 has 8 registers) –
Why not more? Why not less? Each register is 32-bit wide (modern 64-bit architectures have 64-bit wide registers) A 32-bit entity (4 bytes) is referred to as a word To make the code more readable, registers are partitioned as $s0-$s7 (C/Java variables), $t0-$t9 (temporary variables)…

27 Memory Operands Values must be fetched from memory before (add and sub) instructions can operate on them Load word lw $t0, memory-address Store word sw $t0, memory-address How is memory-address determined? Memory Register Memory Register

28 … Memory Address The compiler organizes data in memory… it knows the
location of every variable (saved in a table)… it can fill in the appropriate mem-address for load-store instructions int a, b, c, d[10] Memory Base address

29 Immediate Operands An instruction may require a constant as input
An immediate instruction uses a constant number as one of the inputs (instead of a register operand) addi $s0, $zero, # the program has base address # and this is saved in $s0 # $zero is a register that always # equals zero addi $s1, $s0, # this is the address of variable a addi $s2, $s0, # this is the address of variable b addi $s3, $s0, # this is the address of variable c addi $s4, $s0, # this is the address of variable d[0]

30 Memory Instruction Format
The format of a load instruction: destination register source address lw $t0, 8($t3) any register a constant that is added to the register in brackets

31 Example Convert to assembly: C code: d[3] = d[2] + a;

32 Example Convert to assembly: C code: d[3] = d[2] + a;
Assembly: # addi instructions as before lw $t0, 8($s4) # d[2] is brought into $t0 lw $t1, 0($s1) # a is brought into $t1 add $t0, $t0, $t1 # the sum is in $t0 sw $t0, 12($s4) # $t0 is stored into d[3] Assembly version of the code continues to expand!

33 Recap – Numeric Representations
Decimal Binary Hexadecimal (compact representation) 0x or 23hex 0-15 (decimal)  0-9, a-f (hex)

34 Instruction Formats Instructions are represented as 32-bit numbers (one word), broken into 6 fields R-type instruction add $t0, $s1, $s2 6 bits bits bits bits bits bits op rs rt rd shamt funct opcode source source dest shift amt function I-type instruction lw $t0, 32($s3) 6 bits bits 5 bits bits opcode rs rt constant

35 Logical Operations Logical ops C operators Java operators MIPS instr
Shift Left << << sll Shift Right >> >>> srl Bit-by-bit AND & & and, andi Bit-by-bit OR | | or, ori Bit-by-bit NOT ~ ~ nor

36 Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 Convert to assembly: if (i == j) f = g+h; else f = g-h;

37 Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 Convert to assembly: if (i == j) bne $s3, $s4, Else f = g+h; add $s0, $s1, $s2 else j Exit f = g-h; Else: sub $s0, $s1, $s2 Exit:

38 Example Convert to assembly: while (save[i] == k) i += 1;
i and k are in $s3 and $s5 and base of array save[] is in $s6

39 Example Convert to assembly: while (save[i] == k)
i and k are in $s3 and $s5 and base of array save[] is in $s6 Loop: sll $t1, $s3, 2 add $t1, $t1, $s6 lw $t0, 0($t1) bne $t0, $s5, Exit addi $s3, $s3, 1 j Loop Exit:

40 Lecture 3: MIPS Instruction Set
More MIPS instructions Procedure call/return

41 Memory Operands Values must be fetched from memory before (add and sub) instructions can operate on them Load word lw $t0, memory-address Store word sw $t0, memory-address How is memory-address determined? Memory Register Memory Register

42 … Memory Address The compiler organizes data in memory… it knows the
location of every variable (saved in a table)… it can fill in the appropriate mem-address for load-store instructions int a, b, c, d[10] Memory Base address

43 Immediate Operands An instruction may require a constant as input
An immediate instruction uses a constant number as one of the inputs (instead of a register operand) addi $s0, $zero, # the program has base address # and this is saved in $s0 # $zero is a register that always # equals zero addi $s1, $s0, # this is the address of variable a addi $s2, $s0, # this is the address of variable b addi $s3, $s0, # this is the address of variable c addi $s4, $s0, # this is the address of variable d[0]

44 Memory Instruction Format
The format of a load instruction: destination register source address lw $t0, 8($t3) any register a constant that is added to the register in brackets

45 Example Convert to assembly: C code: d[3] = d[2] + a;
Assembly: # addi instructions as before lw $t0, 8($s4) # d[2] is brought into $t0 lw $t1, 0($s1) # a is brought into $t1 add $t0, $t0, $t1 # the sum is in $t0 sw $t0, 12($s4) # $t0 is stored into d[3] Assembly version of the code continues to expand!

46 Recap – Numeric Representations
Decimal = 3 x x 100 Binary = 1 x x x 20 Hexadecimal (compact representation) 0x or 23hex = 2 x x 160 0-15 (decimal)  0-9, a-f (hex) Dec Binary Hex Dec Binary Hex Dec Binary Hex a b Dec Binary Hex c d e f

47 Instruction Formats Instructions are represented as 32-bit numbers (one word), broken into 6 fields R-type instruction add $t0, $s1, $s2 6 bits bits bits bits bits bits op rs rt rd shamt funct opcode source source dest shift amt function I-type instruction lw $t0, 32($s3) 6 bits bits 5 bits bits opcode rs rt constant

48 Logical Operations Logical ops C operators Java operators MIPS instr
Shift Left << << sll Shift Right >> >>> srl Bit-by-bit AND & & and, andi Bit-by-bit OR | | or, ori Bit-by-bit NOT ~ ~ nor

49 Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 (useful for large case statements and big jumps) Convert to assembly: if (i == j) f = g+h; else f = g-h;

50 Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 (useful for large case statements and big jumps) Convert to assembly: if (i == j) bne $s3, $s4, Else f = g+h; add $s0, $s1, $s2 else j Exit f = g-h; Else: sub $s0, $s1, $s2 Exit:

51 Example Convert to assembly: while (save[i] == k) i += 1;
i and k are in $s3 and $s5 and base of array save[] is in $s6

52 Example Convert to assembly: while (save[i] == k)
i and k are in $s3 and $s5 and base of array save[] is in $s6 Loop: sll $t1, $s3, 2 add $t1, $t1, $s6 lw $t0, 0($t1) bne $t0, $s5, Exit addi $s3, $s3, 1 j Loop Exit:

53 Procedures Each procedure (function, subroutine) maintains a scratchpad of register values – when another procedure is called (the callee), the new procedure takes over the scratchpad – values may have to be saved so we can safely return to the caller parameters (arguments) are placed where the callee can see them control is transferred to the callee acquire storage resources for callee execute the procedure place result value where caller can access it return control to caller

54 Registers The 32 MIPS registers are partitioned as follows:
Register 0 : $zero always stores the constant 0 Regs : $v0, $v1 return values of a procedure Regs : $a0-$a3 input arguments to a procedure Regs 8-15 : $t0-$t7 temporaries Regs 16-23: $s0-$s7 variables Regs 24-25: $t8-$t9 more temporaries Reg : $gp global pointer Reg : $sp stack pointer Reg : $fp frame pointer Reg : $ra return address

55 Jump-and-Link A special register (storage not part of the register file) maintains the address of the instruction currently being executed – this is the program counter (PC) The procedure call is executed by invoking the jump-and-link (jal) instruction – the current PC (actually, PC+4) is saved in the register $ra and we jump to the procedure’s address (the PC is accordingly set to this address) jal NewProcedureAddress Since jal may over-write a relevant value in $ra, it must be saved somewhere (in memory?) before invoking the jal instruction How do we return control back to the caller after completing the callee procedure?

56 … The Stack The register scratchpad for a procedure seems volatile –
it seems to disappear every time we switch procedures – a procedure’s values are therefore backed up in memory on a stack High address Proc A’s values Proc A call Proc B call Proc C return Proc B’s values Proc C’s values Stack grows this way Low address

57 Storage Management on a Call/Return
A new procedure must create space for all its variables on the stack Before executing the jal, the caller must save relevant values in $s0-$s7, $a0-$a3, $ra, temps into its own stack space Arguments are copied into $a0-$a3; the jal is executed After the callee creates stack space, it updates the value of $sp Once the callee finishes, it copies the return value into $v0, frees up stack space, and $sp is incremented On return, the caller may bring in its stack values, ra, temps into registers The responsibility for copies between stack and registers may fall upon either the caller or the callee

58 Example 1 int leaf_example (int g, int h, int i, int j) { int f ;
f = (g + h) – (i + j); return f; }

59 Example 1 int leaf_example (int g, int h, int i, int j) leaf_example:
{ int f ; f = (g + h) – (i + j); return f; } leaf_example: addi $sp, $sp, -12 sw $t1, 8($sp) sw $t0, 4($sp) sw $s0, 0($sp) add $t0, $a0, $a1 add $t1, $a2, $a3 sub $s0, $t0, $t1 add $v0, $s0, $zero lw $s0, 0($sp) lw $t0, 4($sp) lw $t1, 8($sp) addi $sp, $sp, 12 jr $ra Notes: In this example, the procedure’s stack space was used for the caller’s variables, not the callee’s – the compiler decided that was better. The caller took care of saving its $ra and $a0-$a3.

60 Example 2 int fact (int n) { if (n < 1) return (1);
else return (n * fact(n-1)); }

61 Example 2 int fact (int n) { if (n < 1) return (1);
else return (n * fact(n-1)); } fact: addi $sp, $sp, -8 sw $ra, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 1 beq $t0, $zero, L1 addi $v0, $zero, 1 addi $sp, $sp, 8 jr $ra L1: addi $a0, $a0, -1 jal fact lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 mul $v0, $a0, $v0 jr $ra Notes: The caller saves $a0 and $ra in its stack space. Temps are never saved.

62 Memory Organization The space allocated on stack by a procedure is termed the activation record (includes saved values and data local to the procedure) – frame pointer points to the start of the record and stack pointer points to the end – variable addresses are specified relative to $fp as $sp may change during the execution of the procedure $gp points to area in memory that saves global variables Dynamically allocated storage (with malloc()) is placed on the heap Stack Dynamic data (heap) Static data (globals) Text (instructions)

63 Lecture 4: Procedure Calls
Large constants The compilation process

64 Recap The jal instruction is used to jump to the procedure and
save the current PC (+4) into the return address register Arguments are passed in $a0-$a3; return values in $v0-$v1 Since the callee may over-write the caller’s registers, relevant values may have to be copied into memory Each procedure may also require memory space for local variables – a stack is used to organize the memory needs for each procedure

65 … The Stack The register scratchpad for a procedure seems volatile –
it seems to disappear every time we switch procedures – a procedure’s values are therefore backed up in memory on a stack High address Proc A’s values Proc A call Proc B call Proc C return Proc B’s values Proc C’s values Stack grows this way Low address

66 Example 1 int leaf_example (int g, int h, int i, int j) { int f ;
f = (g + h) – (i + j); return f; }

67 Example 1 int leaf_example (int g, int h, int i, int j) leaf_example:
{ int f ; f = (g + h) – (i + j); return f; } leaf_example: addi $sp, $sp, -12 sw $t1, 8($sp) sw $t0, 4($sp) sw $s0, 0($sp) add $t0, $a0, $a1 add $t1, $a2, $a3 sub $s0, $t0, $t1 add $v0, $s0, $zero lw $s0, 0($sp) lw $t0, 4($sp) lw $t1, 8($sp) addi $sp, $sp, 12 jr $ra Notes: In this example, the procedure’s stack space was used for the caller’s variables, not the callee’s – the compiler decided that was better. The caller took care of saving its $ra and $a0-$a3.

68 Example 2 int fact (int n) { if (n < 1) return (1);
else return (n * fact(n-1)); }

69 Example 2 int fact (int n) { if (n < 1) return (1);
else return (n * fact(n-1)); } fact: addi $sp, $sp, -8 sw $ra, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 1 beq $t0, $zero, L1 addi $v0, $zero, 1 addi $sp, $sp, 8 jr $ra L1: addi $a0, $a0, -1 jal fact lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 mul $v0, $a0, $v0 jr $ra Notes: The caller saves $a0 and $ra in its stack space. Temps are never saved.

70 Memory Organization The space allocated on stack by a procedure is termed the activation record (includes saved values and data local to the procedure) – frame pointer points to the start of the record and stack pointer points to the end – variable addresses are specified relative to $fp as $sp may change during the execution of the procedure $gp points to area in memory that saves global variables Dynamically allocated storage (with malloc()) is placed on the heap Stack Dynamic data (heap) Static data (globals) Text (instructions)

71 Dealing with Characters
Instructions are also provided to deal with byte-sized and half-word quantities: lb (load-byte), sb, lh, sh These data types are most useful when dealing with characters, pixel values, etc. C employs ASCII formats to represent characters – each character is represented with 8 bits and a string ends in the null character (corresponding to the 8-bit number 0)

72 Example Convert to assembly: void strcpy (char x[], char y[]) { int i;
while ((x[i] = y[i]) != `\0’) i += 1; }

73 Example Convert to assembly: strcpy: void strcpy (char x[], char y[])
{ int i; i=0; while ((x[i] = y[i]) != `\0’) i += 1; } strcpy: addi $sp, $sp, -4 sw $s0, 0($sp) add $s0, $zero, $zero L1: add $t1, $s0, $a1 lb $t2, 0($t1) add $t3, $s0, $a0 sb $t2, 0($t3) beq $t2, $zero, L2 addi $s0, $s0, 1 j L1 L2: lw $s0, 0($sp) addi $sp, $sp, 4 jr $ra

74 Large Constants Immediate instructions can only specify 16-bit constants The lui instruction is used to store a 16-bit constant into the upper 16 bits of a register… thus, two immediate instructions are used to specify a 32-bit constant The destination PC-address in a conditional branch is specified as a 16-bit constant, relative to the current PC A jump (j) instruction can specify a 26-bit constant; if more bits are required, the jump-register (jr) instruction is used

75 Starting a Program x.c Compiler x.s Assembler x.a, x.so x.o Linker
C Program x.c Compiler Assembly language program x.s Assembler x.a, x.so x.o Object: machine language module Object: library routine (machine language) Linker Executable: machine language program a.out Loader Memory

76 Role of Assembler Convert pseudo-instructions into actual hardware
instructions – pseudo-instrs make it easier to program in assembly – examples: “move”, “blt”, 32-bit immediate operands, etc. Convert assembly instrs into machine instrs – a separate object file (x.o) is created for each C file (x.c) – compute the actual values for instruction labels – maintain info on external references and debugging information

77 Role of Linker Stitches different object files into a single executable patch internal and external references determine addresses of data and instruction labels organize code and data modules in memory Some libraries (DLLs) are dynamically linked – the executable points to dummy routines – these dummy routines call the dynamic linker-loader so they can update the executable to jump to the correct routine

78 Full Example – Sort in C void sort (int v[], int n) { int i, j;
for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); } void swap (int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } Allocate registers to program variables Produce code for the program body Preserve registers across procedure invocations

79 The swap Procedure Register allocation: $a0 and $a1 for the two arguments, $t0 for the temp variable – no need for saves and restores as we’re not using $s0-$s7 and this is a leaf procedure (won’t need to re-use $a0 and $a1) swap: sll $t1, $a1, 2 add $t1, $a0, $t1 lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) jr $ra

80 The sort Procedure Register allocation: arguments v and n use $a0 and $a1, i and j use $s0 and $s1; must save $a0 and $a1 before calling the leaf procedure The outer for loop looks like this: (note the use of pseudo-instrs) move $s0, $zero # initialize the loop loopbody1: bge $s0, $a1, exit1 # will eventually use slt and beq … body of inner loop … addi $s0, $s0, 1 j loopbody1 exit1: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }

81 The sort Procedure The inner for loop looks like this:
addi $s1, $s0, # initialize the loop loopbody2: blt $s1, $zero, exit2 # will eventually use slt and beq sll $t1, $s1, 2 add $t2, $a0, $t1 lw $t3, 0($t2) lw $t4, 4($t2) bgt $t3, $t4, exit2 … body of inner loop … addi $s1, $s1, -1 j loopbody2 exit2: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }

82 Saves and Restores Since we repeatedly call “swap” with $a0 and $a1, we begin “sort” by copying its arguments into $s2 and $s3 – must update the rest of the code in “sort” to use $s2 and $s3 instead of $a0 and $a1 Must save $ra at the start of “sort” because it will get over-written when we call “swap” Must also save $s0-$s3 so we don’t overwrite something that belongs to the procedure that called “sort”

83 Saves and Restores sort: addi $sp, $sp, -20 sw $ra, 16($sp)
sw $s3, 12($sp) sw $s2, 8($sp) sw $s1, 4($sp) sw $s0, 0($sp) move $s2, $a0 move $s3, $a1 move $a0, $s # the inner loop body starts here move $a1, $s1 jal swap exit1: lw $s0, 0($sp) addi $sp, $sp, 20 jr $ra 9 lines of C code  35 lines of assembly

84 Relative Performance Gcc optimization Relative Cycles Instruction CPI
performance count none B B O B B O B B O B B A Java interpreter has relative performance of 0.12, while the Jave just-in-time compiler has relative performance of 2.13 Note that the quicksort algorithm is about three orders of magnitude faster than the bubble sort algorithm (for 100K elements)

85 Lecture 5: MIPS Examples
Today’s topics: the compilation process full example – sort in C Reminder: 2nd assignment will be posted later today

86 Dealing with Characters
Instructions are also provided to deal with byte-sized and half-word quantities: lb (load-byte), sb, lh, sh These data types are most useful when dealing with characters, pixel values, etc. C employs ASCII formats to represent characters – each character is represented with 8 bits and a string ends in the null character (corresponding to the 8-bit number 0)

87 Example Convert to assembly: void strcpy (char x[], char y[]) { int i;
while ((x[i] = y[i]) != `\0’) i += 1; }

88 Example Convert to assembly: strcpy: void strcpy (char x[], char y[])
{ int i; i=0; while ((x[i] = y[i]) != `\0’) i += 1; } strcpy: addi $sp, $sp, -4 sw $s0, 0($sp) add $s0, $zero, $zero L1: add $t1, $s0, $a1 lb $t2, 0($t1) add $t3, $s0, $a0 sb $t2, 0($t3) beq $t2, $zero, L2 addi $s0, $s0, 1 j L1 L2: lw $s0, 0($sp) addi $sp, $sp, 4 jr $ra

89 Large Constants Immediate instructions can only specify 16-bit constants The lui instruction is used to store a 16-bit constant into the upper 16 bits of a register… thus, two immediate instructions are used to specify a 32-bit constant The destination PC-address in a conditional branch is specified as a 16-bit constant, relative to the current PC A jump (j) instruction can specify a 26-bit constant; if more bits are required, the jump-register (jr) instruction is used

90 Starting a Program x.c Compiler x.s Assembler x.a, x.so x.o Linker
C Program x.c Compiler Assembly language program x.s Assembler x.a, x.so x.o Object: machine language module Object: library routine (machine language) Linker Executable: machine language program a.out Loader Memory

91 Role of Assembler Convert pseudo-instructions into actual hardware
instructions – pseudo-instrs make it easier to program in assembly – examples: “move”, “blt”, 32-bit immediate operands, etc. Convert assembly instrs into machine instrs – a separate object file (x.o) is created for each C file (x.c) – compute the actual values for instruction labels – maintain info on external references and debugging information

92 Role of Linker Stitches different object files into a single executable patch internal and external references determine addresses of data and instruction labels organize code and data modules in memory Some libraries (DLLs) are dynamically linked – the executable points to dummy routines – these dummy routines call the dynamic linker-loader so they can update the executable to jump to the correct routine

93 Full Example – Sort in C void sort (int v[], int n) { int i, j;
for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); } void swap (int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } Allocate registers to program variables Produce code for the program body Preserve registers across procedure invocations

94 The swap Procedure void swap (int v[], int k) { int temp; temp = v[k];
v[k] = v[k+1]; v[k+1] = temp; } Allocate registers to program variables Produce code for the program body Preserve registers across procedure invocations

95 The swap Procedure Register allocation: $a0 and $a1 for the two arguments, $t0 for the temp variable – no need for saves and restores as we’re not using $s0-$s7 and this is a leaf procedure (won’t need to re-use $a0 and $a1) swap: sll $t1, $a1, 2 add $t1, $a0, $t1 lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) jr $ra void swap (int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; }

96 The sort Procedure Register allocation: arguments v and n use $a0 and $a1, i and j use $s0 and $s1 for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }

97 The sort Procedure Register allocation: arguments v and n use $a0 and $a1, i and j use $s0 and $s1; must save $a0, $a1, and $ra before calling the leaf procedure The outer for loop looks like this: (note the use of pseudo-instrs) move $s0, $zero # initialize the loop loopbody1: bge $s0, $a1, exit1 # will eventually use slt and beq … body of inner loop … addi $s0, $s0, 1 j loopbody1 exit1: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }

98 The sort Procedure The inner for loop looks like this:
addi $s1, $s0, # initialize the loop loopbody2: blt $s1, $zero, exit2 # will eventually use slt and beq sll $t1, $s1, 2 add $t2, $a0, $t1 lw $t3, 0($t2) lw $t4, 4($t2) bge $t4, $t3, exit2 … body of inner loop … addi $s1, $s1, -1 j loopbody2 exit2: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }

99 Saves and Restores Since we repeatedly call “swap” with $a0 and $a1, we begin “sort” by copying its arguments into $s2 and $s3 – must update the rest of the code in “sort” to use $s2 and $s3 instead of $a0 and $a1 Must save $ra at the start of “sort” because it will get over-written when we call “swap” Must also save $s0-$s3 so we don’t overwrite something that belongs to the procedure that called “sort”

100 Saves and Restores sort: addi $sp, $sp, -20 sw $ra, 16($sp)
sw $s3, 12($sp) sw $s2, 8($sp) sw $s1, 4($sp) sw $s0, 0($sp) move $s2, $a0 move $s3, $a1 move $a0, $s # the inner loop body starts here move $a1, $s1 jal swap exit1: lw $s0, 0($sp) addi $sp, $sp, 20 jr $ra 9 lines of C code  35 lines of assembly for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }

101 Relative Performance Gcc optimization Relative Cycles Instruction CPI
performance count none B B O B B O B B O B B A Java interpreter has relative performance of 0.12, while the Jave just-in-time compiler has relative performance of 2.13 Note that the quicksort algorithm is about three orders of magnitude faster than the bubble sort algorithm (for 100K elements)

102 IA-32 Instruction Set Intel’s IA-32 instruction set has evolved over 20 years – old features are preserved for software compatibility Numerous complex instructions – complicates hardware design (Complex Instruction Set Computer – CISC) Instructions have different sizes, operands can be in registers or memory, only 8 general-purpose registers, one of the operands is over-written RISC instructions are more amenable to high performance (clock speed and parallelism) – modern Intel processors convert IA-32 instructions into simpler micro-operations

103 Lecture 6: Compilers, the SPIM Simulator
Today’s topics: SPIM simulator The compilation process Additional TA hours: Liqun Cheng, legion at cs, Office: MEB 2162 Office hours: Mon/Wed 11-12 TA hours for Josh: Wed 11:45-12:45 (EMCB 130) TA hours for Devyani: Wed 11:45-12:45 (MEB 3431)

104 IA-32 Instruction Set Intel’s IA-32 instruction set has evolved over 20 years – old features are preserved for software compatibility Numerous complex instructions – complicates hardware design (Complex Instruction Set Computer – CISC) Instructions have different sizes, operands can be in registers or memory, only 8 general-purpose registers, one of the operands is over-written RISC instructions are more amenable to high performance (clock speed and parallelism) – modern Intel processors convert IA-32 instructions into simpler micro-operations

105 SPIM SPIM is a simulator that reads in an assembly program
and models its behavior on a MIPS processor Note that a “MIPS add instruction” will eventually be converted to an add instruction for the host computer’s architecture – this translation happens under the hood To simplify the programmer’s task, it accepts pseudo-instructions, large constants, constants in decimal/hex formats, labels, etc. The simulator allows us to inspect register/memory values to confirm that our program is behaving correctly

106 Example This simple program (similar to what we’ve written in class) will run on SPIM (a “main” label is introduced so SPIM knows where to start) main: addi $t0, $zero, 5 addi $t1, $zero, 7 add $t2, $t0, $t1 If we inspect the contents of $t2, we’ll find the number 12

107 User Interface rajeev@trust > spim (spim) read “add.s” (spim) run
(spim) print $10 Reg 10 = 0x c (12) (spim) reinitialize (spim) step (spim) print $8 Reg 8 = 0x (5) (spim) print $9 Reg 9 = 0x (0) Reg 9 = 0x (7) (spim) exit File add.s main: addi $t0, $zero, 5 addi $t1, $zero, 7 add $t2, $t0, $t1

108 Directives File add.s Stack .text .globl main Dynamic data (heap)
addi $t0, $zero, 5 addi $t1, $zero, 7 add $t2, $t0, $t1 jal swap_proc jr $ra .globl swap_proc swap_proc: Static data (globals) Text (instructions) This function is visible to other files

109 Directives File add.s Stack .data .word 5 Dynamic data (heap) .word 7
.byte 25 .asciiz “the answer is” .text .globl main main: lw $t0, 0($gp) lw $t1, 4($gp) add $t2, $t0, $t1 jal swap_proc jr $ra Static data (globals) Text (instructions)

110 Labels File add.s Stack .data in1 .word 5 Dynamic data (heap)
c1 .byte 25 str .asciiz “the answer is” .text .globl main main: lw $t0, in1 lw $t1, in2 add $t2, $t0, $t1 jal swap_proc jr $ra Static data (globals) Text (instructions)

111 Endian-ness Two major formats for transferring values between registers and memory Memory: low address b f high address Little-endian register: the first byte read goes in the low end of the register Register: 7f b 45 Most-significant bit Least-significant bit Big-endian register: the first byte read goes in the big end of the register Register: b f Most-significant bit Least-significant bit

112 System Calls SPIM provides some OS services: most useful are
operations for I/O: read, write, file open, file close The arguments for the syscall are placed in $a0-$a3 The type of syscall is identified by placing the appropriate number in $v0 – 1 for print_int, 4 for print_string, 5 for read_int, etc. $v0 is also used for the syscall’s return value

113 Example Print Routine .data str: .asciiz “the answer is ” .text
li $v0, # load immediate; 4 is the code for print_string la $a0, str # the print_string syscall expects the string # address as the argument; la is the instruction # to load the address of the operand (str) syscall # SPIM will now invoke syscall-4 li $v0, # syscall-1 corresponds to print_int li $a0, # print_int expects the integer as its argument syscall # SPIM will now invoke syscall-1

114 Example Write an assembly program to prompt the user for two numbers and print the sum of the two numbers

115 Example .text .data .globl main str1: .asciiz “Enter 2 numbers:”
main: str2: .asciiz “The sum is ” li $v0, 4 la $a0, str1 syscall li $v0, 5 add $t0, $v0, $zero add $t1, $v0, $zero la $a0, str2 li $v0, 1 add $a0, $t1, $t0

116 Compilation Steps The front-end: deals mostly with language specific actions Scanning: reads characters and breaks them into tokens Parsing: checks syntax Semantic analysis: makes sure operations/types are meaningful Intermediate representation: simple instructions, infinite registers, makes few assumptions about hw The back-end: optimizations and code generation Local optimizations: within a basic block Global optimizations: across basic blocks Register allocation

117 Dataflow Control flow graph: each box represents a basic block and
arcs represent potential jumps between instructions For each block, the compiler computes values that were defined (written to) and used (read from) Such dataflow analysis is key to several optimizations: for example, moving code around, eliminating dead code, removing redundant computations, etc.

118 Register Allocation The IR contains infinite virtual registers – these must be mapped to the architecture’s finite set of registers (say, 32 registers) For each virtual register, its live range is computed (the range between which the register is defined and used) We must now assign one of 32 colors to each virtual register so that intersecting live ranges are colored differently – can be mapped to the famous graph coloring problem If this is not possible, some values will have to be temporarily spilled to memory and restored (this is equivalent to breaking a single live range into smaller live ranges)

119 High-Level Optimizations
High-level optimizations are usually hardware independent Procedure inlining Loop unrolling Loop interchange, blocking (more on this later when we study cache/memory organization)

120 Low-Level Optimizations
Common sub-expression elimination Constant propagation Copy propagation Dead store/code elimination Code motion Induction variable elimination Strength reduction Pipeline scheduling

121 Lecture 7: Computer Arithmetic
Chapter : 3 Lecture 7: Computer Arithmetic Chapter 2 wrap-up Numerical representations Addition and subtraction

122 Compilation Steps The front-end: deals mostly with language specific actions Scanning: reads characters and breaks them into tokens Parsing: checks syntax Semantic analysis: makes sure operations/types are meaningful Intermediate representation: simple instructions, infinite registers, makes few assumptions about hw The back-end: optimizations and code generation Local optimizations: within a basic block Global optimizations: across basic blocks Register allocation

123 Dataflow Control flow graph: each box represents a basic block and
arcs represent potential jumps between instructions For each block, the compiler computes values that were defined (written to) and used (read from) Such dataflow analysis is key to several optimizations: for example, moving code around, eliminating dead code, removing redundant computations, etc.

124 Register Allocation The IR contains infinite virtual registers – these must be mapped to the architecture’s finite set of registers (say, 32 registers) For each virtual register, its live range is computed (the range between which the register is defined and used) We must now assign one of 32 colors to each virtual register so that intersecting live ranges are colored differently – can be mapped to the famous graph coloring problem If this is not possible, some values will have to be temporarily spilled to memory and restored (this is equivalent to breaking a single live range into smaller live ranges)

125 Graph Coloring VR1 VR1 VR2 VR2 VR3 VR4 VR3 VR4 VR1 VR2 VR3 VR4

126 High-Level Optimizations
High-level optimizations are usually hardware independent Procedure inlining Loop unrolling Loop interchange, blocking (more on this later when we study cache/memory organization)

127 Low-Level Optimizations
Common sub-expression elimination Constant propagation Copy propagation Dead store/code elimination Code motion Induction variable elimination Strength reduction Pipeline scheduling

128 Saves on Stack Caller saved
$a0-a3 -- old arguments must be saved before setting new arguments for the callee $ra -- must be saved before the jal instruction over-writes this value $t0-t9 -- if you plan to use your temps after the return, save them note that callees are free to use temps as they please You need not save $s0-s7 as the callee will take care of them Callee saved $s0-s7 -- before the callee uses such a register, it must save the old contents since the caller will usually need it on return local variables -- space is also created on the stack for variables local to that procedure

129 Binary Representation
The binary number represents the quantity 0 x x x … + 1 x 20 A 32-bit word can represent 232 numbers between 0 and … this is known as the unsigned representation as we’re assuming that numbers are always positive Most significant bit Least significant bit

130 ASCII Vs. Binary Does it make more sense to represent a decimal number
in ASCII? Hardware to implement arithmetic would be difficult What are the storage needs? How many bits does it take to represent the decimal number 1,000,000,000 in ASCII and in binary?

131 ASCII Vs. Binary Does it make more sense to represent a decimal number
in ASCII? Hardware to implement arithmetic would be difficult What are the storage needs? How many bits does it take to represent the decimal number 1,000,000,000 in ASCII and in binary? In binary: 30 bits (230 > 1 billion) In ASCII: 10 characters, 8 bits per char = 80 bits

132 Negative Numbers 32 bits can only represent 232 numbers – if we wish to also represent negative numbers, we can represent 231 positive numbers (incl zero) and 231 negative numbers two = 0ten two = 1ten two = 231-1 two = -231 two = -(231 – 1) two = -(231 – 2) two = -2 two = -1

133 2’s Complement Why is this representation favorable?
two = 0ten two = 1ten two = 231-1 two = -231 two = -(231 – 1) two = -(231 – 2) two = -2 two = -1 Why is this representation favorable? Consider the sum of 1 and -2 …. we get -1 Consider the sum of 2 and -1 …. we get +1 This format can directly undergo addition without any conversions! Each number represents the quantity x x x … + x x0 20

134 2’s Complement two = 0ten two = 1ten two = 231-1 two = -231 two = -(231 – 1) two = -(231 – 2) two = -2 two = -1 Note that the sum of a number x and its inverted representation x’ always equals a string of 1s (-1). x + x’ = -1 x’ + 1 = -x … hence, can compute the negative of a number by -x = x’ inverting all bits and adding 1 Similarly, the sum of x and –x gives us all zeroes, with a carry of 1 In reality, x + (-x) = 2n … hence the name 2’s complement

135 Example Compute the 32-bit 2’s complement representations
for the following decimal numbers: 5, -5, -6

136 Example Compute the 32-bit 2’s complement representations
for the following decimal numbers: 5, -5, -6 5: -5: -6: Given -5, verify that negating and adding 1 yields the number 5

137 Signed / Unsigned The hardware recognizes two formats:
unsigned (corresponding to the C declaration unsigned int) -- all numbers are positive, a 1 in the most significant bit just means it is a really large number signed (C declaration is signed int or just int) -- numbers can be +/- , a 1 in the MSB means the number is negative This distinction enables us to represent twice as many numbers when we’re sure that we don’t need negatives

138 MIPS Instructions Consider a comparison instruction:
slt $t0, $t1, $zero and $t1 contains the 32-bit number …01 What gets stored in $t0?

139 MIPS Instructions Consider a comparison instruction:
slt $t0, $t1, $zero and $t1 contains the 32-bit number …01 What gets stored in $t0? The result depends on whether $t1 is a signed or unsigned number – the compiler/programmer must track this and accordingly use either slt or sltu slt $t0, $t1, $zero stores 1 in $t0 sltu $t0, $t1, $zero stores 0 in $t0

140 The Bounds Check Shortcut
Suppose we want to check if 0 <= x < y and x and y are signed numbers (stored in $a1 and $t2) The following single comparison can check both conditions sltu $t0, $a1, $t2 beq $t0, $zero, EitherConditionFails We know that $t2 begins with a 0 If $a1 begins with a 0, sltu is effectively checking the second condition If $a1 begins with a 1, we want the condition to fail and coincidentally, sltu is guaranteed to fail in this case

141 Sign Extension Occasionally, 16-bit signed numbers must be converted
into 32-bit signed numbers – for example, when doing an add with an immediate operand The conversion is simple: take the most significant bit and use it to fill up the additional bits on the left – known as sign extension So 210 goes from to and -210 goes from to

142 Alternative Representations
The following two (intuitive) representations were discarded because they required additional conversion steps before arithmetic could be performed on the numbers sign-and-magnitude: the most significant bit represents +/- and the remaining bits express the magnitude one’s complement: -x is represented by inverting all the bits of x Both representations above suffer from two zeroes

143 Addition and Subtraction
Addition is similar to decimal arithmetic For subtraction, simply add the negative number – hence, subtract A-B involves negating B’s bits, adding 1 and A

144 Overflows For an unsigned number, overflow happens when the last carry (1) cannot be accommodated For a signed number, overflow happens when the most significant bit is not the same as every bit to its left when the sum of two positive numbers is a negative result when the sum of two negative numbers is a positive result The sum of a positive and negative number will never overflow MIPS allows addu and subu instructions that work with unsigned integers and never flag an overflow – to detect the overflow, other instructions will have to be executed

145 Lecture 8: Binary Multiplication & Division
Today’s topics: Addition/Subtraction Multiplication Division Reminder: get started early on assignment 3

146 2’s Complement – Signed Numbers
two = 0ten two = 1ten two = 231-1 two = -231 two = -(231 – 1) two = -(231 – 2) two = -2 two = -1 Why is this representation favorable? Consider the sum of 1 and -2 …. we get -1 Consider the sum of 2 and -1 …. we get +1 This format can directly undergo addition without any conversions! Each number represents the quantity x x x … + x x0 20

147 Alternative Representations
The following two (intuitive) representations were discarded because they required additional conversion steps before arithmetic could be performed on the numbers sign-and-magnitude: the most significant bit represents +/- and the remaining bits express the magnitude one’s complement: -x is represented by inverting all the bits of x Both representations above suffer from two zeroes

148 Addition and Subtraction
Addition is similar to decimal arithmetic For subtraction, simply add the negative number – hence, subtract A-B involves negating B’s bits, adding 1 and A

149 Overflows For an unsigned number, overflow happens when the last carry (1) cannot be accommodated For a signed number, overflow happens when the most significant bit is not the same as every bit to its left when the sum of two positive numbers is a negative result when the sum of two negative numbers is a positive result The sum of a positive and negative number will never overflow MIPS allows addu and subu instructions that work with unsigned integers and never flag an overflow – to detect the overflow, other instructions will have to be executed

150 Multiplication Example
Multiplicand ten Multiplier x ten 1000 0000 Product ten In every step multiplicand is shifted next bit of multiplier is examined (also a shifting step) if this bit is 1, shifted multiplicand is added to the product

151 HW Algorithm 1 In every step multiplicand is shifted
next bit of multiplier is examined (also a shifting step) if this bit is 1, shifted multiplicand is added to the product

152 HW Algorithm 2 32-bit ALU and multiplicand is untouched
the sum keeps shifting right at every step, number of bits in product + multiplier = 64, hence, they share a single 64-bit register

153 Notes The previous algorithm also works for signed numbers
(negative numbers in 2’s complement form) We can also convert negative numbers to positive, multiply the magnitudes, and convert to negative if signs disagree The product of two 32-bit numbers can be a 64-bit number -- hence, in MIPS, the product is saved in two 32-bit registers

154 MIPS Instructions mult $s2, $s3 computes the product and stores
it in two “internal” registers that can be referred to as hi and lo mfhi $s moves the value in hi into $s0 mflo $s moves the value in lo into $s1 Similarly for multu

155 Fast Algorithm The previous algorithm requires a clock to ensure that
the earlier addition has completed before shifting This algorithm can quickly set up most inputs – it then has to wait for the result of each add to propagate down – faster because no clock is involved -- Note: high transistor cost

156 Division 1001ten Quotient Divisor 1000ten | 1001010ten Dividend -1000
10ten Remainder At every step, shift divisor right and compare it with current dividend if divisor is larger, shift 0 as the next bit of the quotient if divisor is smaller, subtract to get new dividend and shift 1 as the next bit of the quotient

157 Division 1001ten Quotient Divisor 1000ten | 1001010ten Dividend
   Quo: At every step, shift divisor right and compare it with current dividend if divisor is larger, shift 0 as the next bit of the quotient if divisor is smaller, subtract to get new dividend and shift 1 as the next bit of the quotient

158 Divide Example Divide 7ten (0000 0111two) by 2ten (0010two) Iter Step
Quot Divisor Remainder Initial values 1 2 3 4 5

159 Divide Example Divide 7ten (0000 0111two) by 2ten (0010two) Iter Step
Quot Divisor Remainder Initial values 0000 1 Rem = Rem – Div Rem < 0  +Div, shift 0 into Q Shift Div right 2 Same steps as 1 3 4 Rem >= 0  shift 1 into Q 0001 5 Same steps as 4 0011

160 Hardware for Division A comparison requires a subtract; the sign of the result is examined; if the result is negative, the divisor must be added back

161 Efficient Division

162 Divisions involving Negatives
Simplest solution: convert to positive and adjust sign later Note that multiple solutions exist for the equation: Dividend = Quotient x Divisor + Remainder +7 div Quo = Rem = -7 div Quo = Rem = +7 div Quo = Rem = -7 div Quo = Rem =

163 Divisions involving Negatives
Simplest solution: convert to positive and adjust sign later Note that multiple solutions exist for the equation: Dividend = Quotient x Divisor + Remainder +7 div Quo = Rem = +1 -7 div Quo = Rem = -1 +7 div Quo = Rem = +1 -7 div Quo = Rem = -1 Convention: Dividend and remainder have the same sign Quotient is negative if signs disagree These rules fulfil the equation above

164 Lecture 9: Floating Point
Division FP arithmetic

165 Division 1001ten Quotient Divisor 1000ten | 1001010ten Dividend -1000
10ten Remainder At every step, shift divisor right and compare it with current dividend if divisor is larger, shift 0 as the next bit of the quotient if divisor is smaller, subtract to get new dividend and shift 1 as the next bit of the quotient

166 Divide Example Divide 7ten (0000 0111two) by 2ten (0010two) Iter Step
Quot Divisor Remainder Initial values 0000 1 Rem = Rem – Div Rem < 0  +Div, shift 0 into Q Shift Div right 2 Same steps as 1 3 4 Rem >= 0  shift 1 into Q 0001 5 Same steps as 4 0011

167 Efficient Division

168 Divisions involving Negatives
Simplest solution: convert to positive and adjust sign later Note that multiple solutions exist for the equation: Dividend = Quotient x Divisor + Remainder +7 div Quo = Rem = -7 div Quo = Rem = +7 div Quo = Rem = -7 div Quo = Rem =

169 Divisions involving Negatives
Simplest solution: convert to positive and adjust sign later Note that multiple solutions exist for the equation: Dividend = Quotient x Divisor + Remainder +7 div Quo = Rem = +1 -7 div Quo = Rem = -1 +7 div Quo = Rem = +1 -7 div Quo = Rem = -1 Convention: Dividend and remainder have the same sign Quotient is negative if signs disagree These rules fulfil the equation above

170 Floating Point Normalized scientific notation: single non-zero digit to the left of the decimal (binary) point – example: 3.5 x 109 x 2-5two = (1 + 0 x x … + 1 x 2-6) x 2-5ten A standard notation enables easy exchange of data between machines and simplifies hardware algorithms – the IEEE 754 standard defines how floating point numbers are represented

171 Sign and Magnitude Representation
Sign Exponent Fraction 1 bit bits bits S E F More exponent bits  wider range of numbers (not necessarily more numbers – recall there are infinite real numbers) More fraction bits  higher precision Register value = (-1)S x F x 2E Since we are only representing normalized numbers, we are guaranteed that the number is of the form 1.xxxx.. Hence, in IEEE 754 standard, the 1 is implicit Register value = (-1)S x (1 + F) x 2E

172 Sign and Magnitude Representation
Sign Exponent Fraction 1 bit bits bits S E F Largest number that can be represented: Smallest number that can be represented:

173 Sign and Magnitude Representation
Sign Exponent Fraction 1 bit bits bits S E F Largest number that can be represented: 2.0 x 2128 = 2.0 x 1038 Smallest number that can be represented: 2.0 x = 2.0 x 10-38 Overflow: when representing a number larger than the one above; Underflow: when representing a number smaller than the one above Double precision format: occupies two 32-bit registers: Largest: Smallest: Sign Exponent Fraction 1 bit bits bits S E F

174 Details The number “0” has a special code so that the implicit 1 does not get added: the code is all 0s (it may seem that this takes up the representation for 1.0, but given how the exponent is represented, we’ll soon see that that’s not the case) The largest exponent value (with zero fraction) represents +/- infinity The largest exponent value (with non-zero fraction) represents NaN (not a number) – for the result of 0/0 or (infinity minus infinity)

175 Exponent Representation
To simplify sort, sign was placed as the first bit For a similar reason, the representation of the exponent is also modified: in order to use integer compares, it would be preferable to have the smallest exponent as 00…0 and the largest exponent as 11…1 This is the biased notation, where a bias is subtracted from the exponent field to yield the true exponent IEEE 754 single-precision uses a bias of 127 (since the exponent must have values between -127 and 128)…double precision uses a bias of 1023 Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias)

176 Examples Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias) Represent ten in single and double-precision formats Single: ( ) Double: ( ) What decimal number is represented by the following single-precision number? …0000

177 Examples Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias) Represent ten in single and double-precision formats Single: ( ) …000 Double: ( ) …000 What decimal number is represented by the following single-precision number? …0000 -5.0

178 FP Addition Consider the following decimal example (can maintain
only 4 decimal digits and 2 exponent digits) x x 10-1 Convert to the larger exponent: x x 101 Add x 101 Normalize x 102 Check for overflow/underflow Round x 102 Re-normalize

179 FP Addition Consider the following decimal example (can maintain
only 4 decimal digits and 2 exponent digits) x x 10-1 Convert to the larger exponent: x x 101 Add x 101 Normalize x 102 Check for overflow/underflow Round x 102 Re-normalize If we had more fraction bits, these errors would be minimized

180 FP Multiplication Similar steps: Compute exponent (careful!)
Multiply significands (set the binary point correctly) Normalize Round (potentially re-normalize) Assign sign

181 MIPS Instructions The usual add.s, add.d, sub, mul, div
Comparison instructions: c.eq.s, c.neq.s, c.lt.s…. These comparisons set an internal bit in hardware that is then inspected by branch instructions: bc1t, bc1f Separate register file $f0 - $f31 : a double-precision value is stored in (say) $f4-$f5 and is referred to by $f4 Load/store instructions (lwc1, swc1) must still use integer registers for address computation

182 Code Example float f2c (float fahr) {
return ((5.0/9.0) * (fahr – 32.0)); } (argument fahr is stored in $f12) lwc1 $f16, const5($gp) lwc1 $f18, const9($gp) div.s $f16, $f16, $f18 lwc1 $f18, const32($gp) sub.s $f18, $f12, $f18 mul.s $f0, $f16, $f18 jr $ra

183 Lecture 10: FP, Performance Metrics
Chapter : 4 Lecture 10: FP, Performance Metrics FP arithmetic Evaluating a system

184 Examples Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias) Represent ten in single and double-precision formats Single: ( ) Double: ( ) What decimal number is represented by the following single-precision number? …0000

185 Examples Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias) Represent ten in single and double-precision formats Single: ( ) …000 Double: ( ) …000 What decimal number is represented by the following single-precision number? …0000 -5.0

186 FP Addition Consider the following decimal example (can maintain
only 4 decimal digits and 2 exponent digits) x x 10-1 Convert to the larger exponent: x x 101 Add x 101 Normalize x 102 Check for overflow/underflow Round x 102 Re-normalize

187 FP Addition Consider the following decimal example (can maintain
only 4 decimal digits and 2 exponent digits) x x 10-1 Convert to the larger exponent: x x 101 Add x 101 Normalize x 102 Check for overflow/underflow Round x 102 Re-normalize If we had more fraction bits, these errors would be minimized

188 FP Multiplication Similar steps: Compute exponent (careful!)
Multiply significands (set the binary point correctly) Normalize Round (potentially re-normalize) Assign sign

189 MIPS Instructions The usual add.s, add.d, sub, mul, div
Comparison instructions: c.eq.s, c.neq.s, c.lt.s…. These comparisons set an internal bit in hardware that is then inspected by branch instructions: bc1t, bc1f Separate register file $f0 - $f31 : a double-precision value is stored in (say) $f4-$f5 and is referred to by $f4 Load/store instructions (lwc1, swc1) must still use integer registers for address computation

190 Code Example float f2c (float fahr) {
return ((5.0/9.0) * (fahr – 32.0)); } (argument fahr is stored in $f12) lwc1 $f16, const5($gp) lwc1 $f18, const9($gp) div.s $f16, $f16, $f18 lwc1 $f18, const32($gp) sub.s $f18, $f12, $f18 mul.s $f0, $f16, $f18 jr $ra

191 Performance Metrics Possible measures:
response time – time elapsed between start and end of a program throughput – amount of work done in a fixed time The two measures are usually linked A faster processor will improve both More processors will likely only improve throughput What influences performance?

192 Execution Time Consider a system X executing a fixed workload W
PerformanceX = 1 / Execution timeX Execution time = response time = wall clock time - Note that this includes time to execute the workload as well as time spent by the operating system co-ordinating various events The UNIX “time” command breaks up the wall clock time as user and system time

193 Speedup and Improvement
System X executes a program in 10 seconds, system Y executes the same program in 15 seconds System X is 1.5 times faster than system Y The speedup of system X over system Y is 1.5 (the ratio) The performance improvement of X over Y is = 0.5 = 50% The execution time reduction for the program, compared to Y is (15-10) / 15 = 33% The execution time increase, compared to X is (15-10) / 10 = 50%

194 Performance Equation - I
CPU execution time = CPU clock cycles x Clock cycle time Clock cycle time = 1 / Clock speed If a processor has a frequency of 3 GHz, the clock ticks 3 billion times in a second – as we’ll soon see, with each clock tick, one or more/less instructions may complete If a program runs for 10 seconds on a 3 GHz processor, how many clock cycles did it run for? If a program runs for 2 billion clock cycles on a 1.5 GHz processor, what is the execution time in seconds?

195 Performance Equation - II
CPU clock cycles = number of instrs x avg clock cycles per instruction (CPI) Substituting in previous equation, Execution time = clock cycle time x number of instrs x avg CPI If a 2 GHz processor graduates an instruction every third cycle, how many instructions are there in a program that runs for 10 seconds?

196 Factors Influencing Performance
Execution time = clock cycle time x number of instrs x avg CPI Clock cycle time: manufacturing process (how fast is each transistor), how much work gets done in each pipeline stage (more on this later) Number of instrs: the quality of the compiler and the instruction set architecture CPI: the nature of each instruction and the quality of the architecture implementation

197 Example Execution time = clock cycle time x number of instrs x avg CPI
Which of the following two systems is better? A program is converted into 4 billion MIPS instructions by a compiler ; the MIPS processor is implemented such that each instruction completes in an average of 1.5 cycles and the clock speed is 1 GHz The same program is converted into 2 billion x86 instructions; the x86 processor is implemented such that each instruction completes in an average of 6 cycles and the clock speed is 1.5 GHz

198 Benchmark Suites Measuring performance components is difficult for most users: average CPI requires simulation/hardware counters, instruction count requires profiling tools/hardware counters, OS interference is hard to quantify, etc. Each vendor announces a SPEC rating for their system a measure of execution time for a fixed collection of programs is a function of a specific CPU, memory system, IO system, operating system, compiler enables easy comparison of different systems The key is coming up with a collection of relevant programs

199 SPEC CPU SPEC: System Performance Evaluation Corporation, an industry
consortium that creates a collection of relevant programs The 2006 version includes 12 integer and 17 floating-point applications The SPEC rating specifies how much faster a system is, compared to a baseline machine – a system with SPEC rating 600 is 1.5 times faster than a system with SPEC rating 400 Note that this rating incorporates the behavior of all 29 programs – this may not necessarily predict performance for your favorite program!

200 Deriving a Single Performance Number
How is the performance of 29 different apps compressed into a single performance number? SPEC uses geometric mean (GM) – the execution time of each program is multiplied and the Nth root is derived Another popular metric is arithmetic mean (AM) – the average of each program’s execution time Weighted arithmetic mean – the execution times of some programs are weighted to balance priorities

201 Amdahl’s Law Architecture design is very bottleneck-driven – make the
common case fast, do not waste resources on a component that has little impact on overall performance/power Amdahl’s Law: performance improvements through an enhancement is limited by the fraction of time the enhancement comes into play Example: a web server spends 40% of time in the CPU and 60% of time doing I/O – a new processor that is ten times faster results in a 36% reduction in execution time (speedup of 1.56) – Amdahl’s Law states that maximum execution time reduction is 40% (max speedup of 1.66)

202 Lecture 11: Digital Design
Evaluating a system Intro to boolean functions

203 Example Execution time = clock cycle time x number of instrs x avg CPI
Which of the following two systems is better? A program is converted into 4 billion MIPS instructions by a compiler ; the MIPS processor is implemented such that each instruction completes in an average of 1.5 cycles and the clock speed is 1 GHz The same program is converted into 2 billion x86 instructions; the x86 processor is implemented such that each instruction completes in an average of 6 cycles and the clock speed is 1.5 GHz

204 Benchmark Suites Measuring performance components is difficult for most users: average CPI requires simulation/hardware counters, instruction count requires profiling tools/hardware counters, OS interference is hard to quantify, etc. Each vendor announces a SPEC rating for their system a measure of execution time for a fixed collection of programs is a function of a specific CPU, memory system, IO system, operating system, compiler enables easy comparison of different systems The key is coming up with a collection of relevant programs

205 SPEC CPU SPEC: System Performance Evaluation Corporation, an industry
consortium that creates a collection of relevant programs The 2006 version includes 12 integer and 17 floating-point applications The SPEC rating specifies how much faster a system is, compared to a baseline machine – a system with SPEC rating 600 is 1.5 times faster than a system with SPEC rating 400 Note that this rating incorporates the behavior of all 29 programs – this may not necessarily predict performance for your favorite program!

206 Deriving a Single Performance Number
How is the performance of 29 different apps compressed into a single performance number? SPEC uses geometric mean (GM) – the execution time of each program is multiplied and the Nth root is derived Another popular metric is arithmetic mean (AM) – the average of each program’s execution time Weighted arithmetic mean – the execution times of some programs are weighted to balance priorities

207 Amdahl’s Law Architecture design is very bottleneck-driven – make the
common case fast, do not waste resources on a component that has little impact on overall performance/power Amdahl’s Law: performance improvements through an enhancement is limited by the fraction of time the enhancement comes into play Example: a web server spends 40% of time in the CPU and 60% of time doing I/O – a new processor that is ten times faster results in a 36% reduction in execution time (speedup of 1.56) – Amdahl’s Law states that maximum execution time reduction is 40% (max speedup of 1.66)

208 Digital Design Basics Two voltage levels – high and low (1 and 0, true and false) Hence, the use of binary arithmetic/logic in all computers A transistor is a 3-terminal device that acts as a switch V V V V Conducting Non-conducting

209 Logic Blocks A logic block has a number of binary inputs and produces
a number of binary outputs – the simplest logic block is composed of a few transistors A logic block is termed combinational if the output is only a function of the inputs A logic block is termed sequential if the block has some internal memory (state) that also influences the output A basic logic block is termed a gate (AND, OR, NOT, etc.) We will only deal with combinational circuits today

210 Truth Table A truth table defines the outputs of a logic block for each set of inputs Consider a block with 3 inputs A, B, C and an output E that is true only if exactly 2 inputs are true A B C E

211 Truth Table A truth table defines the outputs of a logic block for each set of inputs Consider a block with 3 inputs A, B, C and an output E that is true only if exactly 2 inputs are true A B C E Can be compressed by only representing cases that have an output of 1

212 Boolean Algebra Equations involving two values and three primary operators: OR : symbol + , X = A + B  X is true if at least one of A or B is true AND : symbol . , X = A . B  X is true if both A and B are true NOT : symbol , X = A  X is the inverted value of A

213 Boolean Algebra Rules Identity law : A + 0 = A ; A . 1 = A
Zero and One laws : A + 1 = 1 ; A . 0 = 0 Inverse laws : A . A = 0 ; A + A = 1 Commutative laws : A + B = B + A ; A . B = B . A Associative laws : A + (B + C) = (A + B) + C A . (B . C) = (A . B) . C Distributive laws : A . (B + C) = (A . B) + (A . C) A + (B . C) = (A + B) . (A + C)

214 DeMorgan’s Laws A + B = A . B A . B = A + B
Confirm that these are indeed true

215 Pictorial Representations
AND OR NOT What logic function is this?

216 Boolean Equation Consider the logic block that has an output E that is true only if exactly two of the three inputs A, B, C are true

217 Boolean Equation Consider the logic block that has an output E that is true only if exactly two of the three inputs A, B, C are true Multiple correct equations: Two must be true, but all three cannot be true: E = ((A . B) + (B . C) + (A . C)) . (A . B . C) Identify the three cases where it is true: E = (A . B . C) + (A . C . B) + (C . B . A)

218 Sum of Products Can represent any logic block with the AND, OR, NOT operators Draw the truth table For each true output, represent the corresponding inputs as a product The final equation is a sum of these products A B C E (A . B . C) + (A . C . B) + (C . B . A) Can also use “product of sums” Any equation can be implemented with an array of ANDs, followed by an array of ORs

219 NAND and NOR NAND : NOT of AND : A nand B = A . B
NOR : NOT of OR : A nor B = A + B NAND and NOR are universal gates, i.e., they can be used to construct any complex logical function

220 Common Logic Blocks – Decoder
Takes in N inputs and activates one of 2N outputs I0 I1 I O0 O1 O2 O3 O4 O5 O6 O7 3-to-8 Decoder I0-2 O0-7

221 Common Logic Blocks – Multiplexor
Multiplexor or selector: one of N inputs is reflected on the output depending on the value of the log2N selector bits 2-input mux

222 Lecture 12: Hardware for Arithmetic
Designing an ALU Carry-lookahead adder

223 DeMorgan’s Laws A + B = A . B A . B = A + B
Confirm that these are indeed true

224 Sum of Products Can represent any logic block with the AND, OR, NOT operators Draw the truth table For each true output, represent the corresponding inputs as a product The final equation is a sum of these products A B C E (A . B . C) + (A . C . B) + (C . B . A) Can also use “product of sums” Any equation can be implemented with an array of ANDs, followed by an array of ORs

225 Adder Algorithm 1 0 0 1 0 1 0 1 Sum 1 1 1 0 Carry 0 0 0 1
Sum Carry Truth Table for the above operations: A B Cin Sum Cout

226 Adder Algorithm 1 0 0 1 0 1 0 1 Sum 1 1 1 0 Carry 0 0 0 1
Sum Carry Equations: Sum = Cin . A . B + B . Cin . A + A . Cin . B + A . B . Cin Cout = A . B . Cin + A . B . Cin + B . Cin . A = A . B + A . Cin + B . Cin Truth Table for the above operations: A B Cin Sum Cout

227 Carry Out Logic Equations: Sum = Cin . A . B + B . Cin . A +
A . Cin . B + A . B . Cin Cout = A . B . Cin + A . B . Cin + B . Cin . A = A . B + A . Cin + B . Cin

228 1-Bit ALU with Add, Or, And Multiplexor selects between Add, Or, And operations

229 32-bit Ripple Carry Adder
1-bit ALUs are connected “in series” with the carry-out of 1 box going into the carry-in of the next box

230 Incorporating Subtraction
Must invert bits of B and add a 1 Include an inverter CarryIn for the first bit is 1 The CarryIn signal (for the first bit) can be the same as the Binvert signal

231 Incorporating NOR

232 Incorporating slt Perform a – b and check the sign
New signal (Less) that is zero for ALU boxes 1-31 The 31st box has a unit to detect overflow and sign – the sign bit serves as the Less signal for the 0th box

233 Incorporating beq Perform a – b and confirm that the
result is all zero’s

234 Control Lines What are the values of the control lines
and what operations do they correspond to?

235 Control Lines What are the values of the control lines
and what operations do they correspond to? Ai Bn Op AND OR Add Sub SLT NOR

236 Speed of Ripple Carry The carry propagates thru every 1-bit box: each 1-bit box sequentially implements AND and OR – total delay is the time to go through 64 gates! We’ve already seen that any logic equation can be expressed as the sum of products – so it should be possible to compute the result by going through only 2 gates! Caveat: need many parallel gates and each gate may have a very large number of inputs – it is difficult to efficiently build such large gates, so we’ll find a compromise: moderate number of gates moderate number of inputs to each gate moderate number of sequential gates traversed

237 Computing CarryOut CarryIn1 = b0.CarryIn0 + a0.CarryIn0 + a0.b0
= b1.b0.c0 + b1.a0.c0 + b1.a0.b0 + a1.b0.c0 + a1.a0.c0 + a1.a0.b0 + a1.b1 CarryIn32 = a really large sum of really large products Potentially fast implementation as the result is computed by going thru just 2 levels of logic – unfortunately, each gate is enormous and slow

238 Generate and Propagate
Equation re-phrased: Ci+1 = ai.bi + ai.Ci + bi.Ci = (ai.bi) + (ai + bi).Ci Stated verbally, the current pair of bits will generate a carry if they are both 1 and the current pair of bits will propagate a carry if either is 1 Generate signal = ai.bi Propagate signal = ai + bi Therefore, Ci+1 = Gi + Pi . Ci

239 Generate and Propagate
c1 = g0 + p0.c0 c2 = g1 + p1.c1 = g1 + p1.g0 + p1.p0.c0 c3 = g2 + p2.g1 + p2.p1.g0 + p2.p1.p0.c0 c4 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0 + p3.p2.p1.p0.c0 Either, a carry was just generated, or a carry was generated in the last step and was propagated, or a carry was generated two steps back and was propagated by both the next two stages, or a carry was generated N steps back and was propagated by every single one of the N next stages

240 Divide and Conquer The equations on the previous slide are still difficult to implement as logic functions – for the 32nd bit, we must AND every single propagate bit to determine what becomes of c0 (among other things) Hence, the bits are broken into groups (of 4) and each group computes its group-generate and group-propagate For example, to add 32 numbers, you can partition the task as a tree .

241 P and G for 4-bit Blocks Compute P0 and G0 (super-propagate and super-generate) for the first group of 4 bits (and similarly for other groups of 4 bits) P0 = p0.p1.p2.p3 G0 = g3 + g2.p3 + g1.p2.p3 + g0.p1.p2.p3 Carry out of the first group of 4 bits is C1 = G0 + P0.c0 C2 = G1 + P1.G0 + P1.P0.c0 By having a tree of sub-computations, each AND, OR gate has few inputs and logic signals have to travel through a modest set of gates (equal to the height of the tree)

242 Example Add A and B g p P G C4 = 1

243 Carry Look-Ahead Adder
16-bit Ripple-carry takes 32 steps This design takes how many steps?

244 Lecture 13: Sequential Circuits
Carry-lookahead adder Clocks and sequential circuits Finite state machines

245 Speed of Ripple Carry The carry propagates thru every 1-bit box: each 1-bit box sequentially implements AND and OR – total delay is the time to go through 64 gates! We’ve already seen that any logic equation can be expressed as the sum of products – so it should be possible to compute the result by going through only 2 gates! Caveat: need many parallel gates and each gate may have a very large number of inputs – it is difficult to efficiently build such large gates, so we’ll find a compromise: moderate number of gates moderate number of inputs to each gate moderate number of sequential gates traversed

246 Computing CarryOut CarryIn1 = b0.CarryIn0 + a0.CarryIn0 + a0.b0
= b1.b0.c0 + b1.a0.c0 + b1.a0.b0 + a1.b0.c0 + a1.a0.c0 + a1.a0.b0 + a1.b1 CarryIn32 = a really large sum of really large products Potentially fast implementation as the result is computed by going thru just 2 levels of logic – unfortunately, each gate is enormous and slow

247 Generate and Propagate
Equation re-phrased: ci+1 = ai.bi + ai.ci + bi.ci = (ai.bi) + (ai + bi).ci Stated verbally, the current pair of bits will generate a carry if they are both 1 and the current pair of bits will propagate a carry if either is 1 Generate signal = ai.bi Propagate signal = ai + bi Therefore, ci+1 = gi + pi . ci

248 Generate and Propagate
c1 = g0 + p0.c0 c2 = g1 + p1.c1 = g1 + p1.g0 + p1.p0.c0 c3 = g2 + p2.g1 + p2.p1.g0 + p2.p1.p0.c0 c4 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0 + p3.p2.p1.p0.c0 Either, a carry was just generated, or a carry was generated in the last step and was propagated, or a carry was generated two steps back and was propagated by both the next two stages, or a carry was generated N steps back and was propagated by every single one of the N next stages

249 Divide and Conquer The equations on the previous slide are still difficult to implement as logic functions – for the 32nd bit, we must AND every single propagate bit to determine what becomes of c0 (among other things) Hence, the bits are broken into groups (of 4) and each group computes its group-generate and group-propagate For example, to add 32 numbers, you can partition the task as a tree .

250 P and G for 4-bit Blocks Compute P0 and G0 (super-propagate and super-generate) for the first group of 4 bits (and similarly for other groups of 4 bits) P0 = p0.p1.p2.p3 G0 = g3 + g2.p3 + g1.p2.p3 + g0.p1.p2.p3 Carry out of the first group of 4 bits is C1 = G0 + P0.c0 C2 = G1 + P1.G0 + P1.P0.c0 By having a tree of sub-computations, each AND, OR gate has few inputs and logic signals have to travel through a modest set of gates (equal to the height of the tree)

251 Example Add A and B g p P G C4 = 1

252 Carry Look-Ahead Adder
16-bit Ripple-carry takes 32 steps This design takes how many steps?

253 Clocks A microprocessor is composed of many different circuits
that are operating simultaneously – if each circuit X takes in inputs at time TIX, takes time TEX to execute the logic, and produces outputs at time TOX, imagine the complications in co-ordinating the tasks of every circuit A major school of thought (used in most processors built today): all circuits on the chip share a clock signal (a square wave) that tells every circuit when to accept inputs, how much time they have to execute the logic, and when they must produce outputs

254 Clock Terminology Rising clock edge Cycle time Falling clock edge
4 GHz = clock speed = = cycle time ps

255 Sequential Circuits Until now, circuits were combinational – when inputs change, the outputs change after a while (time = logic delay thru circuit) Combinational Circuit Combinational Circuit Inputs Outputs We want the clock to act like a start and stop signal – a “latch” is a storage device that stores its inputs at a rising clock edge and this storage will not change until the next rising clock edge Clock Clock Combinational Circuit Combinational Circuit Outputs Inputs Latch Latch

256 Sequential Circuits Sequential circuit: consists
of combinational circuit and a storage element At the start of the clock cycle, the rising edge causes the “state” storage to store some input values This state will not change for an entire cycle (until next rising edge) The combinational circuit has some time to accept the value of “state” and “inputs” and produce “outputs” Some of the outputs (for example, the value of next “state”) may feed back (but through the latch so they’re only seen in the next cycle Inputs State Clock Outputs Inputs Combinational Cct

257 Designing a Latch An S-R latch: set-reset latch
When Set is high, a 1 is stored When Reset is high, a 0 is stored When both are low, the previous state is preserved (hence, known as a storage or memory element) When both are high, the output is unstable – this set of inputs is therefore not allowed Verify the above behavior!

258 D Latch Incorporates a clock
The value of the input D signal (data) is stored only when the clock is high – the previous state is preserved when the clock is low

259 D Flip Flop Terminology:
Latch: outputs can change any time the clock is high (asserted) Flip flop: outputs can change only on a clock edge Two D latches in series – ensures that a value is stored only on the falling edge of the clock

260 Sequential Circuits We want the clock to act like a start and stop signal – a “latch” is a storage device that stores its inputs at a rising clock edge and this storage will not change until the next rising clock edge Clock Clock Combinational Circuit Combinational Circuit Outputs Inputs Latch Latch

261 Finite State Machine A sequential circuit is described by a variation of a truth table – a finite state diagram (hence, the circuit is also called a finite state machine) Note that state is updated only on a clock edge Next state Next-state Function Current State Clock Output Function Outputs Inputs

262 State Diagrams Each state is shown with a circle, labeled with the state value – the contents of the circle are the outputs An arc represents a transition to a different state, with the inputs indicated on the label D = 0 D = 1 This is a state diagram for ___? D = 1 1 1 D = 0

263 3-Bit Counter Consider a circuit that stores a number and increments the value on every clock edge – on reaching the largest value, it starts again from 0 Draw the state diagram: How many states? How many inputs?

264 3-Bit Counter Consider a circuit that stores a number and increments the value on every clock edge – on reaching the largest value, it starts again from 0 Draw the state diagram: How many states? How many inputs? 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

265 Traffic Light Controller
Problem description: A traffic light with only green and red; either the North-South road has green or the East-West road has green (both can’t be red); there are detectors on the roads to indicate if a car is on the road; the lights are updated every 30 seconds; a light need change only if a car is waiting on the other road State Transition Table: How many states? How many inputs? How many outputs?

266 State Transition Table
Problem description: A traffic light with only green and red; either the North-South road has green or the East-West road has green (both can’t be red); there are detectors on the roads to indicate if a car is on the road; the lights are updated every 30 seconds; a light must change only if a car is waiting on the other road State Transition Table: CurrState InputEW InputNS NextState=Output N N N N N E N E E E E N E E E N

267 State Diagram State Transition Table:
CurrState InputEW InputNS NextState=Output N N N N N E N E E E E N E E E N

268 Lecture 14: FSM and Basic CPU Design
Chapter : 5 Lecture 14: FSM and Basic CPU Design Finite state machines Single-cycle CPU

269 Sequential Circuits We want the clock to act like a start and stop signal – a “latch” is a storage device that stores its inputs at a rising clock edge and this storage will not change until the next rising clock edge Clock Clock Combinational Circuit Combinational Circuit Outputs Inputs Latch Latch

270 Finite State Machine A sequential circuit is described by a variation of a truth table – a finite state diagram (hence, the circuit is also called a finite state machine) Note that state is updated only on a clock edge Next state Next-state Function Current State Clock Output Function Outputs Inputs

271 State Diagrams Each state is shown with a circle, labeled with the state value – the contents of the circle are the outputs An arc represents a transition to a different state, with the inputs indicated on the label D = 0 D = 1 This is a state diagram for ___? D = 1 1 1 D = 0

272 3-Bit Counter Consider a circuit that stores a number and increments the value on every clock edge – on reaching the largest value, it starts again from 0 Draw the state diagram: How many states? How many inputs?

273 3-Bit Counter Consider a circuit that stores a number and increments the value on every clock edge – on reaching the largest value, it starts again from 0 Draw the state diagram: How many states? How many inputs? 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

274 Traffic Light Controller
Problem description: A traffic light with only green and red; either the North-South road has green or the East-West road has green (both can’t be red); there are detectors on the roads to indicate if a car is on the road; the lights are updated every 30 seconds; a light need change only if a car is waiting on the other road State Transition Table: How many states? How many inputs? How many outputs?

275 State Transition Table
Problem description: A traffic light with only green and red; either the North-South road has green or the East-West road has green (both can’t be red); there are detectors on the roads to indicate if a car is on the road; the lights are updated every 30 seconds; a light must change only if a car is waiting on the other road State Transition Table: CurrState InputEW InputNS NextState=Output N N N N N E N E E E E N E E E N

276 State Diagram State Transition Table:
CurrState InputEW InputNS NextState=Output N N N N N E N E E E E N E E E N

277 Basic MIPS Architecture
Now that we understand clocks and storage of states, we’ll design a simple CPU that executes: basic math (add, sub, and, or, slt) memory access (lw and sw) branch and jump instructions (beq and j)

278 Implementation Overview
We need memory to store instructions to store data for now, let’s make them separate units We need registers, ALU, and a whole lot of control logic CPU operations common to all instructions: use the program counter (PC) to pull instruction out of instruction memory read register values

279 Note: we haven’t bothered
View from 30,000 Feet Note: we haven’t bothered showing multiplexors What is the role of the Add units? Explain the inputs to the data memory unit Explain the inputs to the ALU Explain the inputs to the register unit

280 Clocking Methodology Which of the above units need a clock?
What is being saved (latched) on the rising edge of the clock? Keep in mind that the latched value remains there for an entire cycle

281 Implementing R-type Instructions
Instructions of the form add $t1, $t2, $t3 Explain the role of each signal

282 Implementing Loads/Stores
Instructions of the form lw $t1, 8($t2) and sw $t1, 8($t2) Where does this input come from?

283 Implementing J-type Instructions
Instructions of the form beq $t1, $t2, offset

284 View from 10,000 Feet

285 View from 5,000 Feet

286 Single Vs. Multi-Cycle Machine
In this implementation, every instruction requires one cycle to complete  cycle time = time taken for the slowest instruction If the execution was broken into multiple (faster) cycles, the shorter instructions can finish sooner Cycle time = 20 ns Cycle time = 5 ns 1 cycle 4 cycles Load Load 1 cycle 3 cycles Add Add 1 cycle 2 cycles Beq Beq Time for a load, add, and beq = ns ns

287 Lecture 16: Basic CPU Design
Single-cycle CPU Multi-cycle CPU

288 Basic MIPS Architecture
Now that we understand clocks and storage of states, we’ll design a simple CPU that executes: basic math (add, sub, and, or, slt) memory access (lw and sw) branch and jump instructions (beq and j)

289 Implementation Overview
We need memory to store instructions to store data for now, let’s make them separate units We need registers, ALU, and a whole lot of control logic CPU operations common to all instructions: use the program counter (PC) to pull instruction out of instruction memory read register values

290 Note: we haven’t bothered
View from 30,000 Feet Note: we haven’t bothered showing multiplexors What is the role of the Add units? Explain the inputs to the data memory unit Explain the inputs to the ALU Explain the inputs to the register unit

291 Clocking Methodology Which of the above units need a clock?
What is being saved (latched) on the rising edge of the clock? Keep in mind that the latched value remains there for an entire cycle

292 Implementing R-type Instructions
Instructions of the form add $t1, $t2, $t3 Explain the role of each signal

293 Implementing Loads/Stores
Instructions of the form lw $t1, 8($t2) and sw $t1, 8($t2) Where does this input come from?

294 Implementing J-type Instructions
Instructions of the form beq $t1, $t2, offset

295 View from 10,000 Feet

296 View from 5,000 Feet

297 Single Vs. Multi-Cycle Machine
In this implementation, every instruction requires one cycle to complete  cycle time = time taken for the slowest instruction If the execution was broken into multiple (faster) cycles, the shorter instructions can finish sooner Cycle time = 20 ns Cycle time = 5 ns 1 cycle 4 cycles Load Load 1 cycle 3 cycles Add Add 1 cycle 2 cycles Beq Beq Time for a load, add, and beq = ns ns

298 Multi-Cycle Processor
Single memory unit shared by instructions and memory Single ALU also used for PC updates Registers (latches) to store the result of every block

299 Cycle 1 The PC is used to select the appropriate instruction out
of the memory unit The instruction is latched into the instruction register at the end of the clock cycle The ALU performs PC+4 and stores it in the PC at the end of the clock cycle (note that ALU is free this cycle) The control circuits must now be “cycle-aware” – the new PC need not look up the instr-memory until we’re done executing the current instruction

300 Cycle 2 The instruction specifies the required register values –
these are read from the register file and stored in latches A and B (this happens even if the operands are not required) The last 16 bits are also used to compute PC+4+offset (in case this instruction turns out to be a branch) – this is latched into ALUOut Note that we haven’t yet figured out the instruction type, so the above operations are “speculative”

301 Cycle 3 The operations depend on the instruction type
Memory access: the address is computed by adding the offset to the value read from the register file, result is latched into ALUOut ALU: ALU operations are performed on the values read from the register file and the result is latched into ALUOut Branch: the ALU performs the operations for “beq” and if the branch happens, the branch target (currently in ALUOut) is latched into the PC at the end of the cycle Note that the branch operation has completed by the end of cycle 3, the other two are still

302 Cycle 4 Memory access: the address in ALUOut is used to pick
out a word from memory – this is latched into the memory data register ALU: the result latched into ALUOut is fed as input to the register file, the instruction stored in the instruction-latch specifies where the result is written into At the end of this cycle, the ALU operation and memory writes are complete

303 Cycle 5 Memory read: the value read from memory (and latched
in the memory data register) is now written into the register file Summary: Branches and jumps: 3 cycles ALU, stores: 4 cycles Memory access: 5 cycles ALU is slower since it requires a register file write Store is slower since it requires a data memory write Load is slower since it requires a data memory read and a register file write

304 Average CPI Now we can compute average CPI for a program: if the
given program is composed of loads (25%), stores (10%), branches (13%), and ALU ops (52%), the average CPI is 0.25 x x x x 4 = 4.12 You can break this CPU design into shorter cycles, for example, a load would then take 10 cycles, stores 8, ALU 8, branch 6  average CPI would double, but so would the clock speed, the net performance would remain roughly the same Later, we’ll see that this strategy does help in most other cases.

305 Control Logic Note that the control signals for every unit are determined by two factors: the instruction type the cycle number for this instruction The control is therefore implemented as a finite state machine – every cycle, the FSM transitions to a new state with a certain set of outputs (the control signals) and this is a function of the inputs (the instr type)

306 Lecture 17: Basic Pipelining
Chapter : 6 Lecture 17: Basic Pipelining 5-stage pipeline Hazards and instruction scheduling

307 Multi-Cycle Processor
Single memory unit shared by instructions and memory Single ALU also used for PC updates Registers (latches) to store the result of every block

308 The Assembly Line Unpipelined Pipelined
Start and finish a job before moving to the next Jobs Time A B C Break the job into smaller stages A B C A B C A B C Pipelined

309 Performance Improvements?
Does it take longer to finish each individual job? Does it take shorter to finish a series of jobs? What assumptions were made while answering these questions? Is a 10-stage pipeline better than a 5-stage pipeline?

310 Quantitative Effects As a result of pipelining:
Time in ns per instruction goes up Each instruction takes more cycles to execute But… average CPI remains roughly the same Clock speed goes up Total execution time goes down, resulting in lower average time per instruction Under ideal conditions, speedup = ratio of elapsed times between successive instruction completions = number of pipeline stages = increase in clock speed

311 A 5-Stage Pipeline

312 A 5-Stage Pipeline Use the PC to access the I-cache and increment PC by 4

313 A 5-Stage Pipeline Read registers, compare registers, compute branch target; for now, assume branches take 2 cyc (there is enough work that branches can easily take more)

314 A 5-Stage Pipeline ALU computation, effective address computation for load/store

315 A 5-Stage Pipeline Memory access to/from data cache, stores finish in 4 cycles

316 A 5-Stage Pipeline Write result of ALU computation or load into register file

317 Conflicts/Problems I-cache and D-cache are accessed in the same cycle – it helps to implement them separately Registers are read and written in the same cycle – easy to deal with if register read/write time equals cycle time/2 (else, use bypassing) Branch target changes only at the end of the second stage -- what do you do in the meantime? Data between stages get latched into registers (overhead that increases latency per instruction)

318 Hazards Structural hazards: different instructions in different stages
(or the same stage) conflicting for the same resource Data hazards: an instruction cannot continue because it needs a value that has not yet been generated by an earlier instruction Control hazard: fetch cannot continue because it does not know the outcome of an earlier branch – special case of a data hazard – separate category because they are treated in different ways

319 Structural Hazards Example: a unified instruction and data cache 
stage 4 (MEM) and stage 1 (IF) can never coincide The later instruction and all its successors are delayed until a cycle is found when the resource is free  these are pipeline bubbles Structural hazards are easy to eliminate – increase the number of resources (for example, implement a separate instruction and data cache)

320 Data Hazards

321 Bypassing Some data hazard stalls can be eliminated: bypassing

322 Data Hazard Stalls

323 Data Hazard Stalls

324 Example add $1, $2, $3 lw $4, 8($1)

325 Example lw $1, 8($2) lw $4, 8($1)

326 Example lw $1, 8($2) sw $1, 8($3)

327 Control Hazards Simple techniques to handle control hazard stalls:
for every branch, introduce a stall cycle (note: every 6th instruction is a branch!) assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instruction fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost

328 Branch Delay Slots

329 Slowdowns from Stalls Perfect pipelining with no hazards  an instruction completes every cycle (total cycles ~ num instructions)  speedup = increase in clock speed = num pipeline stages With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes Total cycles = number of instructions + stall cycles

330 Lecture 18: Pipelining Hazards and instruction scheduling
Branch prediction Out-of-order execution

331 Structural Hazards Example: a unified instruction and data cache 
stage 4 (MEM) and stage 1 (IF) can never coincide The later instruction and all its successors are delayed until a cycle is found when the resource is free  these are pipeline bubbles Structural hazards are easy to eliminate – increase the number of resources (for example, implement a separate instruction and data cache)

332 Data Hazards

333 Bypassing Some data hazard stalls can be eliminated: bypassing

334 Example add $1, $2, $3 lw $4, 8($1)

335 Example lw $1, 8($2) lw $4, 8($1)

336 Example lw $1, 8($2) sw $1, 8($3)

337 Control Hazards Simple techniques to handle control hazard stalls:
for every branch, introduce a stall cycle (note: every 6th instruction is a branch!) assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instruction fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost

338 Branch Delay Slots

339 Pipeline without Branch Predictor
PC IF (br) Reg Read Compare Br-target PC + 4

340 Pipeline with Branch Predictor
PC IF (br) Reg Read Compare Br-target Branch Predictor

341 Bimodal Predictor Branch PC Table of 16K entries 14 bits of 2-bit
saturating counters 14 bits Branch PC

342 2-Bit Prediction For each branch, maintain a 2-bit saturating counter:
if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1) … sound familiar? If (counter >= 2), predict taken, else predict not taken The counter attempts to capture the common case for each branch

343 Slowdowns from Stalls Perfect pipelining with no hazards  an instruction completes every cycle (total cycles ~ num instructions)  speedup = increase in clock speed = num pipeline stages With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes Total cycles = number of instructions + stall cycles

344 Multicycle Instructions
Multiple parallel pipelines – each pipeline can have a different number of stages Instructions can now complete out of order – must make sure that writes to a register happen in the correct order

345 An Out-of-Order Processor Implementation
Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 Register File R1-R32 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Decode & Rename T1  R1+R2 T2  T1+R3 BEQZ T2 T4  T1+T2 T5  T4+T2 ALU ALU ALU Instr Fetch Queue Results written to ROB and tags broadcast to IQ Issue Queue (IQ)

346 Chapter : 7 Lecture 19: Cache Basics Out-of-order execution
Cache hierarchies

347 Multicycle Instructions
Multiple parallel pipelines – each pipeline can have a different number of stages Instructions can now complete out of order – must make sure that writes to a register happen in the correct order

348 An Out-of-Order Processor Implementation
Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 Register File R1-R32 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Decode & Rename T1  R1+R2 T2  T1+R3 BEQZ T2 T4  T1+T2 T5  T4+T2 ALU ALU ALU Instr Fetch Queue Results written to ROB and tags broadcast to IQ Issue Queue (IQ)

349 Cache Hierarchies Data and instructions are stored on DRAM chips – DRAM is a technology that has high bit density, but relatively poor latency – an access to data in memory can take as many as 300 cycles today! Hence, some data is stored on the processor in a structure called the cache – caches employ SRAM technology, which is faster, but has lower bit density Internet browsers also cache web pages – same concept

350 Memory Hierarchy As you go further, capacity and latency increase
Disk 80 GB 10M cycles Memory 1GB 300 cycles L2 cache 2MB 15 cycles L1 data or instruction Cache 32KB 2 cycles Registers 1KB 1 cycle

351 Locality Why do caches work?
Temporal locality: if you used some data recently, you will likely use it again Spatial locality: if you used some data recently, you will likely access its neighbors No hierarchy: average access time for data = 300 cycles 32KB 1-cycle L1 cache that has a hit rate of 95%: average access time = 0.95 x x (301) = 16 cycles

352 Accessing the Cache Byte address 101000 Offset 8-byte words
8 words: 3 index bits Direct-mapped cache: each address maps to a unique address Sets Data array

353 The Tag Array Byte address 101000 Tag 8-byte words Compare
Direct-mapped cache: each address maps to a unique address Tag array Data array

354 Example Access Pattern
Byte address Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 10, 13, 16, 68, 73, 78, 83, 88, 4, 7, 10… 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array

355 Increasing Line Size Byte address
A large cache line size  smaller tag array, fewer misses because of spatial locality 32-byte cache line size or block size Tag Offset Tag array Data array

356 Associativity Byte address
Set associativity  fewer conflicts; wasted power because multiple data and tags are read Tag Way-1 Way-2 Tag array Data array Compare

357 How many offset/index/tag bits if the cache has
Associativity How many offset/index/tag bits if the cache has 64 sets, each set has 64 bytes, 4 ways Byte address Tag Way-1 Way-2 Tag array Data array Compare

358 Example 32 KB 4-way set-associative data cache array with 32
byte line sizes How many sets? How many index bits, offset bits, tag bits? How large is the tag array?

359 Cache Misses On a write miss, you may either choose to bring the block
into the cache (write-allocate) or not (write-no-allocate) On a read miss, you always bring the block in (spatial and temporal locality) – but which block do you replace? no choice for a direct-mapped cache randomly pick one of the ways to replace replace the way that was least-recently used (LRU) FIFO replacement (round-robin)

360 Writes When you write into a block, do you also update the copy in L2?
write-through: every write to L1  write to L2 write-back: mark the block as dirty, when the block gets replaced from L1, write it to L2 Writeback coalesces multiple writes to an L1 block into one L2 write Writethrough simplifies coherency protocols in a multiprocessor system as the L2 always has a current copy of data

361 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed – the misses for an infinite cache Capacity misses: happens because the program touched many other words before re-touching the same word – the misses for a fully-associative cache Conflict misses: happens because two words map to the same location in the cache – the misses generated while moving from a fully-associative to a direct-mapped cache

362 Lecture 20: Cache Hierarchies, Virtual Memory

363 Example Access Pattern
Byte address Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 10, 13, 16, 68, 73, 78, 83, 88, 4, 7, 10… 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array

364 Increasing Line Size Byte address
A large cache line size  smaller tag array, fewer misses because of spatial locality 32-byte cache line size or block size Tag Offset Tag array Data array

365 Associativity Byte address
Set associativity  fewer conflicts; wasted power because multiple data and tags are read Tag Way-1 Way-2 Tag array Data array Compare

366 How many offset/index/tag bits if the cache has
Associativity How many offset/index/tag bits if the cache has 64 sets, each set has 64 bytes, 4 ways Byte address Tag Way-1 Way-2 Tag array Data array Compare

367 Example 32 KB 4-way set-associative data cache array with 32
byte line sizes How many sets? How many index bits, offset bits, tag bits? How large is the tag array?

368 Cache Misses On a write miss, you may either choose to bring the block
into the cache (write-allocate) or not (write-no-allocate) On a read miss, you always bring the block in (spatial and temporal locality) – but which block do you replace? no choice for a direct-mapped cache randomly pick one of the ways to replace replace the way that was least-recently used (LRU) FIFO replacement (round-robin)

369 Writes When you write into a block, do you also update the copy in L2?
write-through: every write to L1  write to L2 write-back: mark the block as dirty, when the block gets replaced from L1, write it to L2 Writeback coalesces multiple writes to an L1 block into one L2 write Writethrough simplifies coherency protocols in a multiprocessor system as the L2 always has a current copy of data

370 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed – the misses for an infinite cache Capacity misses: happens because the program touched many other words before re-touching the same word – the misses for a fully-associative cache Conflict misses: happens because two words map to the same location in the cache – the misses generated while moving from a fully-associative to a direct-mapped cache

371 Virtual Memory Processes deal with virtual memory – they have the
illusion that a very large address space is available to them There is only a limited amount of physical memory that is shared by all processes – a process places part of its virtual memory in this physical memory and the rest is stored on disk (called swap space) Thanks to locality, disk access is likely to be uncommon The hardware ensures that one process cannot access the memory of a different process

372 Translated to physical
Address Translation The virtual and physical memory are broken up into pages 8KB page size Virtual address 13 virtual page number page offset Translated to physical page number Physical address

373 Memory Hierarchy Properties
A virtual memory page can be placed anywhere in physical memory (fully-associative) Replacement is usually LRU (since the miss penalty is huge, we can invest some effort to minimize misses) A page table (indexed by virtual page number) is used for translating virtual to physical page number The page table is itself in memory

374 TLB Since the number of pages is very high, the page table
capacity is too large to fit on chip A translation lookaside buffer (TLB) caches the virtual to physical page number translation for recent accesses A TLB miss requires us to access the page table, which may not even be found in the cache – two expensive memory look-ups to access one word of data! A large page size can increase the coverage of the TLB and reduce the capacity of the page table, but also increases memory wastage

375 TLB and Cache Is the cache indexed with virtual or physical address?
To index with a physical address, we will have to first look up the TLB, then the cache  longer access time Multiple virtual addresses can map to the same physical address – must ensure that these different virtual addresses will map to the same location in cache – else, there will be two different copies of the same physical memory word Does the tag array store virtual or physical addresses? Since multiple virtual addresses can map to the same physical address, a virtual tag comparison can flag a miss even if the correct physical memory word is present

376 Cache and TLB Pipeline Virtually Indexed; Physically Tagged Cache
Virtual address Offset Virtual page number Virtual index TLB Tag array Data array Physical page number Physical tag Physical tag comparion Virtually Indexed; Physically Tagged Cache

377 Bad Events Consider the longest latency possible for a load instruction: TLB miss: must look up page table to find translation for v.page P Calculate the virtual memory address for the page table entry that has the translation for page P – let’s say, this is v.page Q TLB miss for v.page Q: will require navigation of a hierarchical page table (let’s ignore this case for now and assume we have succeeded in finding the physical memory location (R) for page Q) Access memory location R (find this either in L1, L2, or memory) We now have the translation for v.page P – put this into the TLB We now have a TLB hit and know the physical page number – this allows us to do tag comparison and check the L1 cache for a hit If there’s a miss in L1, check L2 – if that misses, check in memory At any point, if the page table entry claims that the page is on disk, flag a page fault – the OS then copies the page from disk to memory and the hardware resumes what it was doing before the page fault … phew!

378 Lecture 21: Virtual Memory, I/O Basics
I/O overview

379 Virtual Memory Processes deal with virtual memory – they have the
illusion that a very large address space is available to them There is only a limited amount of physical memory that is shared by all processes – a process places part of its virtual memory in this physical memory and the rest is stored on disk (called swap space) Thanks to locality, disk access is likely to be uncommon The hardware ensures that one process cannot access the memory of a different process

380 Translated to physical
Address Translation The virtual and physical memory are broken up into pages 8KB page size Virtual address 13 virtual page number page offset Translated to physical page number Physical address

381 Memory Hierarchy Properties
A virtual memory page can be placed anywhere in physical memory (fully-associative) Replacement is usually LRU (since the miss penalty is huge, we can invest some effort to minimize misses) A page table (indexed by virtual page number) is used for translating virtual to physical page number The page table is itself in memory

382 TLB Since the number of pages is very high, the page table
capacity is too large to fit on chip A translation lookaside buffer (TLB) caches the virtual to physical page number translation for recent accesses A TLB miss requires us to access the page table, which may not even be found in the cache – two expensive memory look-ups to access one word of data! A large page size can increase the coverage of the TLB and reduce the capacity of the page table, but also increases memory wastage

383 TLB and Cache Is the cache indexed with virtual or physical address?
To index with a physical address, we will have to first look up the TLB, then the cache  longer access time Multiple virtual addresses can map to the same physical address – must ensure that these different virtual addresses will map to the same location in cache – else, there will be two different copies of the same physical memory word Does the tag array store virtual or physical addresses? Since multiple virtual addresses can map to the same physical address, a virtual tag comparison can flag a miss even if the correct physical memory word is present

384 Cache and TLB Pipeline Virtually Indexed; Physically Tagged Cache
Virtual address Offset Virtual page number Virtual index TLB Tag array Data array Physical page number Physical tag Physical tag comparion Virtually Indexed; Physically Tagged Cache

385 Bad Events Consider the longest latency possible for a load instruction: TLB miss: must look up page table to find translation for v.page P Calculate the virtual memory address for the page table entry that has the translation for page P – let’s say, this is v.page Q TLB miss for v.page Q: will require navigation of a hierarchical page table (let’s ignore this case for now and assume we have succeeded in finding the physical memory location (R) for page Q) Access memory location R (find this either in L1, L2, or memory) We now have the translation for v.page P – put this into the TLB We now have a TLB hit and know the physical page number – this allows us to do tag comparison and check the L1 cache for a hit If there’s a miss in L1, check L2 – if that misses, check in memory At any point, if the page table entry claims that the page is on disk, flag a page fault – the OS then copies the page from disk to memory and the hardware resumes what it was doing before the page fault … phew!

386 Input/Output CPU Cache Bus Memory Disk Network USB DVD

387 … I/O Hierarchy CPU Cache Disk Memory Bus Memory I/O Controller
I/O Bus Network USB DVD

388 Intel Example P4 Processor System bus 800 MHz, 604 GB/sec Memory
Graphics output Memory Controller Hub (North Bridge) Main Memory 2.1 GB/sec DDR 400 3.2 GB/sec 1 Gb Ethernet 266 MB/sec 266 MB/sec Serial ATA 150 MB/s I/O Controller Hub (South Bridge) CD/DVD Disk 100 MB/s Tape 100 MB/s USB 2.0 60 MB/s

389 Bus Design The bus is a shared resource – any device can send
data on the bus (after first arbitrating for it) and all other devices can read this data off the bus The address/control signals on the bus specify the intended receiver of the message The length of the bus determines its speed (hence, a hierarchy makes sense) Buses can be synchronous (a clock determines when each operation must happen) or asynchronous (a handshaking protocol is used to co-ordinate operations)

390 Memory-Mapped I/O Each I/O device has its own special address range
The CPU issues commands such as these: sw [some-data] [some-address] Usually, memory services these requests… if the address is in the I/O range, memory ignores it The data is written into some register in the appropriate I/O device – this serves as the command to the device

391 Polling Vs. Interrupt-Driven
When the I/O device is ready to respond, it can send an interrupt to the CPU; the CPU stops what it was doing; the OS examines the interrupt and then reads the data produced by the I/O device (and usually stores into memory) In the polling approach, the CPU (OS) periodically checks the status of the I/O device and if the device is ready with data, the OS reads it

392 Direct Memory Access (DMA)
Consider a disk read example: a block in disk is being read into memory For each word, the CPU does a lw [destination register] [I/O device address] and a sw [data in above register] [memory-address] This would take up too much of the CPU’s time – hence, the task is off-loaded to the DMA controller – the CPU informs the DMA of the range of addresses to be copied and the DMA lets the CPU know when it is done

393 Lecture 22: I/O, Disk Systems
Chapter : 8 I/O overview Disk basics RAID Lecture 22: I/O, Disk Systems

394 Input/Output CPU Cache Bus Memory Disk Network USB DVD

395 … I/O Hierarchy CPU Cache Disk Memory Bus Memory I/O Controller
I/O Bus Network USB DVD

396 Intel Example P4 Processor System bus 800 MHz, 604 GB/sec Memory
Graphics output Memory Controller Hub (North Bridge) Main Memory 2.1 GB/sec DDR 400 3.2 GB/sec 1 Gb Ethernet 266 MB/sec 266 MB/sec Serial ATA 150 MB/s I/O Controller Hub (South Bridge) CD/DVD Disk 100 MB/s Tape 100 MB/s USB 2.0 60 MB/s

397 Bus Design The bus is a shared resource – any device can send
data on the bus (after first arbitrating for it) and all other devices can read this data off the bus The address/control signals on the bus specify the intended receiver of the message The length of the bus determines its speed (hence, a hierarchy makes sense) Buses can be synchronous (a clock determines when each operation must happen) or asynchronous (a handshaking protocol is used to co-ordinate operations)

398 Memory-Mapped I/O Each I/O device has its own special address range
The CPU issues commands such as these: sw [some-data] [some-address] Usually, memory services these requests… if the address is in the I/O range, memory ignores it The data is written into some register in the appropriate I/O device – this serves as the command to the device

399 Polling Vs. Interrupt-Driven
When the I/O device is ready to respond, it can send an interrupt to the CPU; the CPU stops what it was doing; the OS examines the interrupt and then reads the data produced by the I/O device (and usually stores into memory) In the polling approach, the CPU (OS) periodically checks the status of the I/O device and if the device is ready with data, the OS reads it

400 Direct Memory Access (DMA)
Consider a disk read example: a block in disk is being read into memory For each word, the CPU does a lw [destination register] [I/O device address] and a sw [data in above register] [memory-address] This would take up too much of the CPU’s time – hence, the task is off-loaded to the DMA controller – the CPU informs the DMA of the range of addresses to be copied and the DMA lets the CPU know when it is done

401 Role of I/O Activities external to the CPU are typically orders of
magnitude slower Example: while CPU performance has improved by 50% per year, disk latencies have improved by 10% every year Typical strategy on I/O: switch contexts and work on something else Other metrics, such as bandwidth, reliability, availability, and capacity, often receive more attention than performance

402 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material on both sides), with diameters between inches Each platter is comprised of concentric tracks (5-30K) and each track is divided into sectors (100 – 500 per track, each about 512 bytes) A movable arm holds the read/write heads for each disk surface and moves them all in tandem – a cylinder of data is accessible at a time

403 Disk Latency To read/write data, the arm has to be placed on the
correct track – this seek time usually takes 5 to 12 ms on average – can take less if there is spatial locality Rotational latency is the time taken to rotate the correct sector under the head – average is typically more than 2 ms (15,000 RPM) Transfer time is the time taken to transfer a block of bits out of the disk and is typically 3 – 65 MB/second A disk controller maintains a disk cache (spatial locality can be exploited) and sets up the transfer on the bus (controller overhead)

404 Defining Reliability and Availability
A system toggles between Service accomplishment: service matches specifications Service interruption: service deviates from specs The toggle is caused by failures and restorations Reliability measures continuous service accomplishment and is usually expressed as mean time to failure (MTTF) Availability measures fraction of time that service matches specifications, expressed as MTTF / (MTTF + MTTR)

405 RAID Reliability and availability are important metrics for disks
RAID: redundant array of inexpensive (independent) disks Redundancy can deal with one or more failures Each sector of a disk records check information that allows it to determine if the disk has an error or not (in other words, redundancy already exists within a disk) When the disk read flags an error, we turn elsewhere for correct data

406 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) – it
uses an array of disks and stripes (interleaves) data across the arrays to improve parallelism and throughput RAID 1 mirrors or shadows every disk – every write happens to two disks Reads to the mirror may happen only when the primary disk fails – or, you may try to read both together and the quicker response is accepted Expensive solution: high reliability at twice the cost

407 RAID 3 Data is bit-interleaved across several disks and a separate
disk maintains parity information for a set of bits For example: with 8 disks, bit 0 is in disk-0, bit 1 is in disk-1, …, bit 7 is in disk-7; disk-8 maintains parity for all 8 bits For any read, 8 disks must be accessed (as we usually read more than a byte at a time) and for any write, 9 disks must be accessed as parity has to be re-calculated High throughput for a single request, low cost for redundancy (overhead: 12.5%), low task-level parallelism

408 RAID 4 and RAID 5 Data is block interleaved – this allows us to get all our data from a single disk on a read – in case of a disk error, read all 9 disks Block interleaving reduces thruput for a single request (as only a single disk drive is servicing the request), but improves task-level parallelism as other disk drives are free to service other requests On a write, we access the disk that stores the data and the parity disk – parity information can be updated simply by checking if the new data differs from the old data

409 RAID 5 If we have a single disk for parity, multiple writes can not
happen in parallel (as all writes must update parity info) RAID 5 distributes the parity block to allow simultaneous writes

410 RAID Summary RAID 1-5 can tolerate a single fault – mirroring (RAID 1)
has a 100% overhead, while parity (RAID 3, 4, 5) has modest overhead Can tolerate multiple faults by having multiple check functions – each additional check can cost an additional disk (RAID 6) RAID 6 and RAID 2 (memory-style ECC) are not commercially employed

411 I/O Performance Throughput (bandwidth) and response times (latency)
are the key performance metrics for I/O The description of the hardware characterizes maximum throughput and average response time (usually with no queueing delays) The description of the workload characterizes the “real” throughput – corresponding to this throughput is an average response time

412 Throughput Vs. Response Time
As load increases, throughput increases (as utilization is high) – simultaneously, response times also go up as the probability of having to wait for the service goes up: trade-off between throughput and response time In systems involving human interaction, there are three relevant delays: data entry time, system response time, and think time – studies have shown that improvements in response time result in improvements in think time  better response time and much better throughput Most benchmark suites try to determine throughput while placing a restriction on response times

413 Lecture 23: Multiprocessors
Chapter : 9 Lecture 23: Multiprocessors RAID Multiprocessor taxonomy Snooping-based cache coherence protocol

414 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) – it
uses an array of disks and stripes (interleaves) data across the arrays to improve parallelism and throughput RAID 1 mirrors or shadows every disk – every write happens to two disks Reads to the mirror may happen only when the primary disk fails – or, you may try to read both together and the quicker response is accepted Expensive solution: high reliability at twice the cost

415 RAID 3 Data is bit-interleaved across several disks and a separate
disk maintains parity information for a set of bits For example: with 8 disks, bit 0 is in disk-0, bit 1 is in disk-1, …, bit 7 is in disk-7; disk-8 maintains parity for all 8 bits For any read, 8 disks must be accessed (as we usually read more than a byte at a time) and for any write, 9 disks must be accessed as parity has to be re-calculated High throughput for a single request, low cost for redundancy (overhead: 12.5%), low task-level parallelism

416 RAID 4 and RAID 5 Data is block interleaved – this allows us to get all our data from a single disk on a read – in case of a disk error, read all 9 disks Block interleaving reduces thruput for a single request (as only a single disk drive is servicing the request), but improves task-level parallelism as other disk drives are free to service other requests On a write, we access the disk that stores the data and the parity disk – parity information can be updated simply by checking if the new data differs from the old data

417 RAID 5 If we have a single disk for parity, multiple writes can not
happen in parallel (as all writes must update parity info) RAID 5 distributes the parity block to allow simultaneous writes

418 RAID Summary RAID 1-5 can tolerate a single fault – mirroring (RAID 1)
has a 100% overhead, while parity (RAID 3, 4, 5) has modest overhead Can tolerate multiple faults by having multiple check functions – each additional check can cost an additional disk (RAID 6) RAID 6 and RAID 2 (memory-style ECC) are not commercially employed

419 Multiprocessor Taxonomy
SISD: single instruction and single data stream: uniprocessor MISD: no commercial multiprocessor: imagine data going through a pipeline of execution engines SIMD: vector architectures: lower flexibility MIMD: most multiprocessors today: easy to construct with off-the-shelf computers, most flexibility

420 Memory Organization - I
Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) Multiple processors connected to a single centralized memory – since all processors see the same memory organization  uniform memory access (UMA) Shared-memory because all processors can access the entire memory address space Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors

421 SMPs or Centralized Shared-Memory
Processor Processor Processor Processor Caches Caches Caches Caches Main Memory I/O System

422 Memory Organization - II
For higher scalability, memory is distributed among processors  distributed memory multiprocessors If one processor can directly address the memory local to another processor, the address space is shared  distributed shared-memory (DSM) multiprocessor If memories are strictly local, we need messages to communicate data  cluster of computers or multicomputers Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory

423 Interconnection network
Distributed Memory Multiprocessors Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Interconnection network

424 SMPs Centralized main memory and many caches  many
copies of the same data A system is cache coherent if a read returns the most recently written value for that word Time Event Value of X in Cache-A Cache-B Memory CPU-A reads X CPU-B reads X CPU-A stores 0 in X

425 Cache Coherence A memory system is coherent if:
P writes to X; no other processor writes to X; P reads X and receives the value previously written by P P1 writes to X; no other processor writes to X; sufficient time elapses; P2 reads X and receives value written by P1 Two writes to the same location by two processors are seen in the same order by all processors – write serialization The memory consistency model defines “time elapsed” before the effect of a processor is seen by others

426 Cache Coherence Protocols
Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block

427 Design Issues Three states for a block: invalid, shared, modified
A write is placed on the bus and sharers invalidate themselves Processor Processor Processor Processor Caches Caches Caches Caches Main Memory I/O System

428 Lecture 24: Multiprocessors
Directory-based cache coherence protocol Synchronization Consistency Writing parallel programs

429 Snooping-Based Protocols
Three states for a block: invalid, shared, modified A write is placed on the bus and sharers invalidate themselves The protocols are referred to as MSI, MESI, etc. Processor Processor Processor Processor Caches Caches Caches Caches Main Memory I/O System

430 Example P1 reads X: not found in cache-1, request sent on bus, memory responds, X is placed in cache-1 in shared state P2 reads X: not found in cache-2, request sent on bus, everyone snoops this request, cache-1does nothing because this is just a read request, memory responds, X is placed in cache-2 in shared state P1 P2 P1 writes X: cache-1 has data in shared state (shared only provides read perms), request sent on bus, cache-2 snoops and then invalidates its copy of X, cache-1 moves its state to modified P2 reads X: cache-2 has data in invalid state, request sent on bus, cache-1 snoops and realizes it has the only valid copy, so it downgrades itself to shared state and responds with data, X is placed in cache-2 in shared state Cache-1 Cache-2 Main Memory

431 Cache Coherence Protocols
Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block

432 Coherence in Distributed Memory Multiprocs
Distributed memory systems are typically larger  bus-based snooping may not work well Option 1: software-based mechanisms – message-passing systems or software-controlled cache coherence Option 2: hardware-based mechanisms – directory-based cache coherence

433 Interconnection network
Distributed Memory Multiprocessors Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Directory Directory Directory Directory Interconnection network

434 Directory-Based Cache Coherence
The physical memory is distributed among all processors The directory is also distributed along with the corresponding memory The physical address is enough to determine the location of memory The (many) processing nodes are connected with a scalable interconnect (not a bus) – hence, messages are no longer broadcast, but routed from sender to receiver – since the processing nodes can no longer snoop, the directory keeps track of sharing state

435 Cache Block States What are the different states a block of memory can have within the directory? Note that we need information for each cache so that invalidate messages can be sent The directory now serves as the arbitrator: if multiple write attempts happen simultaneously, the directory determines the ordering

436 Interconnection network
Directory-Based Example A: Rd X B: Rd X C: Rd X A: Wr X C: Wr X A: Rd Y B: Wr X B: Rd Y B: Wr Y Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Directory Directory X Directory Y Interconnection network

437 Directory Actions If block is in uncached state:
Read miss: send data, make block shared Write miss: send data, make block exclusive If block is in shared state: Read miss: send data, add node to sharers list Write miss: send data, invalidate sharers, make excl If block is in exclusive state: Read miss: ask owner for data, write to memory, send data, make shared, add node to sharers list Data write back: write to memory, make uncached Write miss: ask owner for data, write to memory, send data, update identity of new owner, remain exclusive

438 Constructing Locks Applications have phases (consisting of many instructions) that must be executed atomically, without other parallel processes modifying the data A lock surrounding the data/code ensures that only one program can be in a critical section at a time The hardware must provide some basic primitives that allows us to construct locks with different properties Bank balance $1000 Parallel (unlocked) banking transactions Rd $1000 Add $100 Wr $1100 Rd $1000 Add $200 Wr $1200

439 Synchronization The simplest hardware primitive that greatly facilitates synchronization implementations (locks, barriers, etc.) is an atomic read-modify-write Atomic exchange: swap contents of register and memory Special case of atomic exchange: test & set: transfer memory location into register and write 1 into memory (if memory has 0, lock is free) lock: t&s register, location bnz register, lock CS st location, #0 When multiple parallel threads execute this code, only one will be able to enter CS

440 Coherence Vs. Consistency
Recall that coherence guarantees (i) write propagation (a write will eventually be seen by other processors), and (ii) write serialization (all processors see writes to the same location in the same order) The consistency model defines the ordering of writes and reads to different memory locations – the hardware guarantees a certain consistency model and the programmer attempts to write correct programs with those assumptions

441 Consistency Example Consider a multiprocessor with bus-based snooping cache coherence and a write buffer between CPU and cache Initially A = B = 0 P P2 A  B  1 … … if (B == 0) if (A == 0) Crit.Section Crit.Section The programmer expected the above code to implement a lock – because of write buffering, both processors can enter the critical section The consistency model lets the programmer know what assumptions they can make about the hardware’s reordering capabilities

442 Sequential Consistency
A multiprocessor is sequentially consistent if the result of the execution is achieveable by maintaining program order within a processor and interleaving accesses by different processors in an arbitrary fashion The multiprocessor in the previous example is not sequentially consistent Can implement sequential consistency by requiring the following: program order, write serialization, everyone has seen an update before a value is read – very intuitive for the programmer, but extremely slow

443 Shared-Memory Vs. Message-Passing
Well-understood programming model Communication is implicit and hardware handles protection Hardware-controlled caching Message-passing: No cache coherence  simpler hardware Explicit communication  easier for the programmer to restructure code Software-controlled caching Sender can initiate data transfer

444 Ocean Kernel Procedure Solve(A) begin diff = done = 0;
while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure

445 Shared Address Space Model
procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main

446 Message Passing Model main() for i  1 to nn do read(n); read(nprocs);
CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA  malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); RECEIVE(&myA[0,0], n, pid-1, ROW); RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i  1 to nn do for j  1 to n do endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i  1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; if (mydiff < TOL) done = 1; for i  1 to nprocs-1 do SEND(done, 1, I, DONE); endif endwhile

447 Lecture 25: Multi-core Processors
Writing parallel programs SMT Multi-core examples

448 Shared-Memory Vs. Message-Passing
Well-understood programming model Communication is implicit and hardware handles protection Hardware-controlled caching Message-passing: No cache coherence  simpler hardware Explicit communication  easier for the programmer to restructure code Software-controlled caching Sender can initiate data transfer

449 . . Ocean Kernel Row 1 Row k Row 2k Row 3k … Procedure Solve(A) begin
diff = done = 0; while (!done) do diff = 0; for i  1 to n do for j  1 to n do temp = A[i,j]; A[i,j]  0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure Row 1 . Row k Row 2k Row 3k

450 Shared Address Space Model
procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i  mymin to mymax for j  1 to n do endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A  G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main

451 Message Passing Model main() for i  1 to nn do read(n); read(nprocs);
CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA  malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); RECEIVE(&myA[0,0], n, pid-1, ROW); RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i  1 to nn do for j  1 to n do endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i  1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; if (mydiff < TOL) done = 1; for i  1 to nprocs-1 do SEND(done, 1, I, DONE); endif endwhile

452 Multithreading Within a Processor
Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor? Why is this desireable? inexpensive – one CPU, no external interconnects no remote or coherence misses (more capacity misses) Why does this make sense? most processors can’t find enough work – peak IPC is 6, average IPC is 1.5! threads can share resources  we can increase threads without a corresponding linear increase in area

453 How are Resources Shared?
Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Thread 1 Thread 2 Thread 3 Cycles Thread 4 Idle Superscalar Fine-Grained Multithreading Simultaneous Multithreading Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot

454 Performance Implications of SMT
Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4

455 Pentium4: Hyper-Threading
Two threads – the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor

456 Multi-Programmed Speedup

457 Why Multi-Cores? New constraints: power, temperature, complexity
Because of the above, we can’t introduce complex techniques to improve single-thread performance Most of the low-hanging fruit for single-thread performance has been picked Hence, additional transistors have the biggest impact on throughput if they are used to execute multiple threads … this assumes that most users will run multi-threaded applications

458 Efficient Use of Transistors
Transistors can be used for: Cache hierarchies Number of cores Multi-threading within a core (SMT) Should we simplify cores so we have more available transistors? Core Cache bank

459 Design Space Exploration
Bullet p – scalar pipelines t – threads s – superscalar pipelines From Davis et al., PACT 2005

460 Case Study I: Sun’s Niagara
Commercial servers require high thread-level throughput and suffer from cache misses Sun’s Niagara focuses on: simple cores (low power, design complexity, can accommodate more cores) fine-grain multi-threading (to tolerate long memory latencies)

461 Niagara Overview

462 SPARC Pipe No branch predictor Low clock speed (1.2 GHz)
One FP unit shared by all cores

463 Case Study II: Intel Core Architecture
Single-thread execution is still considered important  out-of-order execution and speculation very much alive initial processors will have few heavy-weight cores To reduce power consumption, the Core architecture (14 pipeline stages) is closer to the Pentium M (12 stages) than the P4 (30 stages) Many transistors invested in a large branch predictor to reduce wasted work (power) Similarly, SMT is also not guaranteed for all incarnations of the Core architecture (SMT makes a hotspot hotter)

464 Cache Organizations for Multi-cores
L1 caches are always private to a core L2 caches can be private or shared – which is better? P1 P2 P3 P4 P1 P2 P3 P4 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2

465 Cache Organizations for Multi-cores
L1 caches are always private to a core L2 caches can be private or shared Advantages of a shared L2 cache: efficient dynamic allocation of space to each core data shared by multiple cores is not replicated every block has a fixed “home” – hence, easy to find the latest copy Advantages of a private L2 cache: quick access to private L2 – good for small working sets private bus to private L2  less contention


Download ppt "Chapter :1 Introduction"

Similar presentations


Ads by Google