Download presentation
Presentation is loading. Please wait.
1
Chapter :1 Introduction
logistics why computer organization is important modern trends
2
Why Computer Organization
Yes, I know, required class…
3
Why Computer Organization
Embarrassing if you are a BS in CS/CE and can’t make sense of the following terms: DRAM, pipelining, cache hierarchies, I/O, virtual memory Embarrassing if you are a BS in CS/CE and can’t decide which processor to buy: 3 GHz P4 or 2.5 GHz Athlon (helps us reason about performance/power) Obvious first step for chip designers, compiler/OS writers Will knowledge of the hardware help me write better programs?
4
Must a Programmer Care About Hardware?
Memory management: if we understand how/where data is placed, we can help ensure that relevant data is nearby Thread management: if we understand how threads interact, we can write smarter multi-threaded programs Why do we care about multi-threaded programs?
5
Microprocessor Performance
50% improvement every year!! What contributes to this improvement?
6
Modern Trends Historical contributions to performance:
Better processes (faster devices) ~20% Better circuits/pipelines ~15% Better organization/architecture ~15% In the future, bullet-2 will help little and bullet-3 will not help much for a single core! Pentium P-Pro P-II P-III P Itanium Montecito Year Transistors M M 7.5M 9.5M 42M M M Clock Speed M M 300M 500M M M M Moore’s Law in action At this point, adding transistors to a core yields little benefit
7
What Does This Mean to a Programmer?
In the past, a new chip directly meant 50% higher performance for a program Today, one can expect only a 20% improvement, unless… the program can be broken up into multiple threads Expect #threads to emerge as a major metric for software quality 4-way multi-core 8-way multi-core
8
Challenges for the Hardware Designers
Major concerns: The performance problem (especially scientific workloads) The power dissipation problem (especially embedded processors) The temperature problem The reliability problem
9
The HW/SW Interface a[i] = b[i] + c; Application software Compiler
lw $15, 0($2) add $16, $15, $14 add $17, $15, $13 lw $18, 0($12) lw $19, 0($17) add $20, $18, $19 sw $20, 0($16) Systems software (OS, compiler) Assembler Hardware …
10
Computer Components Input/output devices
Secondary storage: non-volatile, slower, cheaper Primary storage: volatile, faster, costlier CPU/processor
11
Wafers and Dies
12
Manufacturing Process
Silicon wafers undergo many processing steps so that different parts of the wafer behave as insulators, conductors, and transistors (switches) Multiple metal layers on the silicon enable connections between transistors The wafer is chopped into many dies – the size of the die determines yield and cost
13
Processor Technology Trends
Shrinking of transistor sizes: 250nm (1997) 130nm (2002) 70nm (2008) 35nm (2014) Transistor density increases by 35% per year and die size increases by 10-20% per year… functionality improvements! Transistor speed improves linearly with size (complex equation involving voltages, resistances, capacitances) Wire delays do not scale down at the same rate as transistor delays
14
Memory and I/O Technology Trends
DRAM density increases by 40-60% per year, latency has reduced by 33% in 10 years (the memory wall!), bandwidth improves twice as fast as latency decreases Disk density improves by 100% every year, latency improvement similar to DRAM Networks: primary focus on bandwidth; 10Mb 100Mb in 10 years; 100Mb 1Gb in 5 years
15
Power Consumption Trends
Dyn power activity x capacitance x voltage2 x frequency Capacitance per transistor and voltage are decreasing, but number of transistors and frequency are increasing at a faster rate Leakage power is also rising and will soon match dynamic power Power consumption is already around 100W in some high-performance processors today
16
Next Class Topics: MIPS instruction set architecture (Chapter 2)
Visit the class web-page Sign up for the mailing list Pick up CADE Lab passwords
17
Lectuure : 1 MIPS Instruction Set
Chapter : 2 Lectuure : 1 MIPS Instruction Set MIPS instructions
18
Recap Knowledge of hardware improves software quality:
compilers, OS, threaded programs, memory management Important trends: growing transistors, move to multi-core, slowing rate of performance improvement, power/thermal constraints, long memory/disk latencies
19
Instruction Set Understanding the language of the hardware is key to understanding the hardware/software interface A program (in say, C) is compiled into an executable that is composed of machine instructions – this executable must also run on future machines – for example, each Intel processor reads in the same x86 instructions, but each processor handles instructions differently Java programs are converted into portable bytecode that is converted into machine instructions during execution (just-in-time compilation) What are important design principles when defining the instruction set architecture (ISA)?
20
Instruction Set Important design principles when defining the
instruction set architecture (ISA): keep the hardware simple – the chip must only implement basic primitives and run fast keep the instructions regular – simplifies the decoding/scheduling of instructions
21
A Basic MIPS Instruction
C code: a = b + c ; Assembly code: (human-friendly machine instructions) add a, b, c # a is the sum of b and c Machine code: (hardware-friendly machine instructions) Translate the following C code into assembly code: a = b + c + d + e;
22
Example C code a = b + c + d + e;
translates into the following assembly code: add a, b, c add a, b, c add a, a, d or add f, d, e add a, a, e add a, a, f Instructions are simple: fixed number of operands (unlike C) A single line of C code is converted into multiple lines of assembly code Some sequences are better than others… the second sequence needs one more (temporary) variable f
23
Subtract Example C code f = (g + h) – (i + j);
Assembly code translation with only add and sub instructions:
24
Subtract Example C code f = (g + h) – (i + j);
translates into the following assembly code: add t0, g, h add f, g, h add t1, i, j or sub f, f, i sub f, t0, t sub f, f, j Each version may produce a different result because floating-point operations are not necessarily associative and commutative… more on this later
25
Operands In C, each “variable” is a location in memory
In hardware, each memory access is expensive – if variable a is accessed repeatedly, it helps to bring the variable into an on-chip scratchpad and operate on the scratchpad (registers) To simplify the instructions, we require that each instruction (add, sub) only operate on registers Note: the number of operands (variables) in a C program is very large; the number of operands in assembly is fixed… there can be only so many scratchpad registers
26
Registers The MIPS ISA has 32 registers (x86 has 8 registers) –
Why not more? Why not less? Each register is 32-bit wide (modern 64-bit architectures have 64-bit wide registers) A 32-bit entity (4 bytes) is referred to as a word To make the code more readable, registers are partitioned as $s0-$s7 (C/Java variables), $t0-$t9 (temporary variables)…
27
Memory Operands Values must be fetched from memory before (add and sub) instructions can operate on them Load word lw $t0, memory-address Store word sw $t0, memory-address How is memory-address determined? Memory Register Memory Register
28
… Memory Address The compiler organizes data in memory… it knows the
location of every variable (saved in a table)… it can fill in the appropriate mem-address for load-store instructions int a, b, c, d[10] … Memory Base address
29
Immediate Operands An instruction may require a constant as input
An immediate instruction uses a constant number as one of the inputs (instead of a register operand) addi $s0, $zero, # the program has base address # and this is saved in $s0 # $zero is a register that always # equals zero addi $s1, $s0, # this is the address of variable a addi $s2, $s0, # this is the address of variable b addi $s3, $s0, # this is the address of variable c addi $s4, $s0, # this is the address of variable d[0]
30
Memory Instruction Format
The format of a load instruction: destination register source address lw $t0, 8($t3) any register a constant that is added to the register in brackets
31
Example Convert to assembly: C code: d[3] = d[2] + a;
32
Example Convert to assembly: C code: d[3] = d[2] + a;
Assembly: # addi instructions as before lw $t0, 8($s4) # d[2] is brought into $t0 lw $t1, 0($s1) # a is brought into $t1 add $t0, $t0, $t1 # the sum is in $t0 sw $t0, 12($s4) # $t0 is stored into d[3] Assembly version of the code continues to expand!
33
Recap – Numeric Representations
Decimal Binary Hexadecimal (compact representation) 0x or 23hex 0-15 (decimal) 0-9, a-f (hex)
34
Instruction Formats Instructions are represented as 32-bit numbers (one word), broken into 6 fields R-type instruction add $t0, $s1, $s2 6 bits bits bits bits bits bits op rs rt rd shamt funct opcode source source dest shift amt function I-type instruction lw $t0, 32($s3) 6 bits bits 5 bits bits opcode rs rt constant
35
Logical Operations Logical ops C operators Java operators MIPS instr
Shift Left << << sll Shift Right >> >>> srl Bit-by-bit AND & & and, andi Bit-by-bit OR | | or, ori Bit-by-bit NOT ~ ~ nor
36
Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 Convert to assembly: if (i == j) f = g+h; else f = g-h;
37
Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 Convert to assembly: if (i == j) bne $s3, $s4, Else f = g+h; add $s0, $s1, $s2 else j Exit f = g-h; Else: sub $s0, $s1, $s2 Exit:
38
Example Convert to assembly: while (save[i] == k) i += 1;
i and k are in $s3 and $s5 and base of array save[] is in $s6
39
Example Convert to assembly: while (save[i] == k)
i and k are in $s3 and $s5 and base of array save[] is in $s6 Loop: sll $t1, $s3, 2 add $t1, $t1, $s6 lw $t0, 0($t1) bne $t0, $s5, Exit addi $s3, $s3, 1 j Loop Exit:
40
Lecture 3: MIPS Instruction Set
More MIPS instructions Procedure call/return
41
Memory Operands Values must be fetched from memory before (add and sub) instructions can operate on them Load word lw $t0, memory-address Store word sw $t0, memory-address How is memory-address determined? Memory Register Memory Register
42
… Memory Address The compiler organizes data in memory… it knows the
location of every variable (saved in a table)… it can fill in the appropriate mem-address for load-store instructions int a, b, c, d[10] … Memory Base address
43
Immediate Operands An instruction may require a constant as input
An immediate instruction uses a constant number as one of the inputs (instead of a register operand) addi $s0, $zero, # the program has base address # and this is saved in $s0 # $zero is a register that always # equals zero addi $s1, $s0, # this is the address of variable a addi $s2, $s0, # this is the address of variable b addi $s3, $s0, # this is the address of variable c addi $s4, $s0, # this is the address of variable d[0]
44
Memory Instruction Format
The format of a load instruction: destination register source address lw $t0, 8($t3) any register a constant that is added to the register in brackets
45
Example Convert to assembly: C code: d[3] = d[2] + a;
Assembly: # addi instructions as before lw $t0, 8($s4) # d[2] is brought into $t0 lw $t1, 0($s1) # a is brought into $t1 add $t0, $t0, $t1 # the sum is in $t0 sw $t0, 12($s4) # $t0 is stored into d[3] Assembly version of the code continues to expand!
46
Recap – Numeric Representations
Decimal = 3 x x 100 Binary = 1 x x x 20 Hexadecimal (compact representation) 0x or 23hex = 2 x x 160 0-15 (decimal) 0-9, a-f (hex) Dec Binary Hex Dec Binary Hex Dec Binary Hex a b Dec Binary Hex c d e f
47
Instruction Formats Instructions are represented as 32-bit numbers (one word), broken into 6 fields R-type instruction add $t0, $s1, $s2 6 bits bits bits bits bits bits op rs rt rd shamt funct opcode source source dest shift amt function I-type instruction lw $t0, 32($s3) 6 bits bits 5 bits bits opcode rs rt constant
48
Logical Operations Logical ops C operators Java operators MIPS instr
Shift Left << << sll Shift Right >> >>> srl Bit-by-bit AND & & and, andi Bit-by-bit OR | | or, ori Bit-by-bit NOT ~ ~ nor
49
Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 (useful for large case statements and big jumps) Convert to assembly: if (i == j) f = g+h; else f = g-h;
50
Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 (useful for large case statements and big jumps) Convert to assembly: if (i == j) bne $s3, $s4, Else f = g+h; add $s0, $s1, $s2 else j Exit f = g-h; Else: sub $s0, $s1, $s2 Exit:
51
Example Convert to assembly: while (save[i] == k) i += 1;
i and k are in $s3 and $s5 and base of array save[] is in $s6
52
Example Convert to assembly: while (save[i] == k)
i and k are in $s3 and $s5 and base of array save[] is in $s6 Loop: sll $t1, $s3, 2 add $t1, $t1, $s6 lw $t0, 0($t1) bne $t0, $s5, Exit addi $s3, $s3, 1 j Loop Exit:
53
Procedures Each procedure (function, subroutine) maintains a scratchpad of register values – when another procedure is called (the callee), the new procedure takes over the scratchpad – values may have to be saved so we can safely return to the caller parameters (arguments) are placed where the callee can see them control is transferred to the callee acquire storage resources for callee execute the procedure place result value where caller can access it return control to caller
54
Registers The 32 MIPS registers are partitioned as follows:
Register 0 : $zero always stores the constant 0 Regs : $v0, $v1 return values of a procedure Regs : $a0-$a3 input arguments to a procedure Regs 8-15 : $t0-$t7 temporaries Regs 16-23: $s0-$s7 variables Regs 24-25: $t8-$t9 more temporaries Reg : $gp global pointer Reg : $sp stack pointer Reg : $fp frame pointer Reg : $ra return address
55
Jump-and-Link A special register (storage not part of the register file) maintains the address of the instruction currently being executed – this is the program counter (PC) The procedure call is executed by invoking the jump-and-link (jal) instruction – the current PC (actually, PC+4) is saved in the register $ra and we jump to the procedure’s address (the PC is accordingly set to this address) jal NewProcedureAddress Since jal may over-write a relevant value in $ra, it must be saved somewhere (in memory?) before invoking the jal instruction How do we return control back to the caller after completing the callee procedure?
56
… The Stack The register scratchpad for a procedure seems volatile –
it seems to disappear every time we switch procedures – a procedure’s values are therefore backed up in memory on a stack High address Proc A’s values Proc A call Proc B … call Proc C return Proc B’s values Proc C’s values … Stack grows this way Low address
57
Storage Management on a Call/Return
A new procedure must create space for all its variables on the stack Before executing the jal, the caller must save relevant values in $s0-$s7, $a0-$a3, $ra, temps into its own stack space Arguments are copied into $a0-$a3; the jal is executed After the callee creates stack space, it updates the value of $sp Once the callee finishes, it copies the return value into $v0, frees up stack space, and $sp is incremented On return, the caller may bring in its stack values, ra, temps into registers The responsibility for copies between stack and registers may fall upon either the caller or the callee
58
Example 1 int leaf_example (int g, int h, int i, int j) { int f ;
f = (g + h) – (i + j); return f; }
59
Example 1 int leaf_example (int g, int h, int i, int j) leaf_example:
{ int f ; f = (g + h) – (i + j); return f; } leaf_example: addi $sp, $sp, -12 sw $t1, 8($sp) sw $t0, 4($sp) sw $s0, 0($sp) add $t0, $a0, $a1 add $t1, $a2, $a3 sub $s0, $t0, $t1 add $v0, $s0, $zero lw $s0, 0($sp) lw $t0, 4($sp) lw $t1, 8($sp) addi $sp, $sp, 12 jr $ra Notes: In this example, the procedure’s stack space was used for the caller’s variables, not the callee’s – the compiler decided that was better. The caller took care of saving its $ra and $a0-$a3.
60
Example 2 int fact (int n) { if (n < 1) return (1);
else return (n * fact(n-1)); }
61
Example 2 int fact (int n) { if (n < 1) return (1);
else return (n * fact(n-1)); } fact: addi $sp, $sp, -8 sw $ra, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 1 beq $t0, $zero, L1 addi $v0, $zero, 1 addi $sp, $sp, 8 jr $ra L1: addi $a0, $a0, -1 jal fact lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 mul $v0, $a0, $v0 jr $ra Notes: The caller saves $a0 and $ra in its stack space. Temps are never saved.
62
Memory Organization The space allocated on stack by a procedure is termed the activation record (includes saved values and data local to the procedure) – frame pointer points to the start of the record and stack pointer points to the end – variable addresses are specified relative to $fp as $sp may change during the execution of the procedure $gp points to area in memory that saves global variables Dynamically allocated storage (with malloc()) is placed on the heap Stack Dynamic data (heap) Static data (globals) Text (instructions)
63
Lecture 4: Procedure Calls
Large constants The compilation process
64
Recap The jal instruction is used to jump to the procedure and
save the current PC (+4) into the return address register Arguments are passed in $a0-$a3; return values in $v0-$v1 Since the callee may over-write the caller’s registers, relevant values may have to be copied into memory Each procedure may also require memory space for local variables – a stack is used to organize the memory needs for each procedure
65
… The Stack The register scratchpad for a procedure seems volatile –
it seems to disappear every time we switch procedures – a procedure’s values are therefore backed up in memory on a stack High address Proc A’s values Proc A call Proc B … call Proc C return Proc B’s values Proc C’s values … Stack grows this way Low address
66
Example 1 int leaf_example (int g, int h, int i, int j) { int f ;
f = (g + h) – (i + j); return f; }
67
Example 1 int leaf_example (int g, int h, int i, int j) leaf_example:
{ int f ; f = (g + h) – (i + j); return f; } leaf_example: addi $sp, $sp, -12 sw $t1, 8($sp) sw $t0, 4($sp) sw $s0, 0($sp) add $t0, $a0, $a1 add $t1, $a2, $a3 sub $s0, $t0, $t1 add $v0, $s0, $zero lw $s0, 0($sp) lw $t0, 4($sp) lw $t1, 8($sp) addi $sp, $sp, 12 jr $ra Notes: In this example, the procedure’s stack space was used for the caller’s variables, not the callee’s – the compiler decided that was better. The caller took care of saving its $ra and $a0-$a3.
68
Example 2 int fact (int n) { if (n < 1) return (1);
else return (n * fact(n-1)); }
69
Example 2 int fact (int n) { if (n < 1) return (1);
else return (n * fact(n-1)); } fact: addi $sp, $sp, -8 sw $ra, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 1 beq $t0, $zero, L1 addi $v0, $zero, 1 addi $sp, $sp, 8 jr $ra L1: addi $a0, $a0, -1 jal fact lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 mul $v0, $a0, $v0 jr $ra Notes: The caller saves $a0 and $ra in its stack space. Temps are never saved.
70
Memory Organization The space allocated on stack by a procedure is termed the activation record (includes saved values and data local to the procedure) – frame pointer points to the start of the record and stack pointer points to the end – variable addresses are specified relative to $fp as $sp may change during the execution of the procedure $gp points to area in memory that saves global variables Dynamically allocated storage (with malloc()) is placed on the heap Stack Dynamic data (heap) Static data (globals) Text (instructions)
71
Dealing with Characters
Instructions are also provided to deal with byte-sized and half-word quantities: lb (load-byte), sb, lh, sh These data types are most useful when dealing with characters, pixel values, etc. C employs ASCII formats to represent characters – each character is represented with 8 bits and a string ends in the null character (corresponding to the 8-bit number 0)
72
Example Convert to assembly: void strcpy (char x[], char y[]) { int i;
while ((x[i] = y[i]) != `\0’) i += 1; }
73
Example Convert to assembly: strcpy: void strcpy (char x[], char y[])
{ int i; i=0; while ((x[i] = y[i]) != `\0’) i += 1; } strcpy: addi $sp, $sp, -4 sw $s0, 0($sp) add $s0, $zero, $zero L1: add $t1, $s0, $a1 lb $t2, 0($t1) add $t3, $s0, $a0 sb $t2, 0($t3) beq $t2, $zero, L2 addi $s0, $s0, 1 j L1 L2: lw $s0, 0($sp) addi $sp, $sp, 4 jr $ra
74
Large Constants Immediate instructions can only specify 16-bit constants The lui instruction is used to store a 16-bit constant into the upper 16 bits of a register… thus, two immediate instructions are used to specify a 32-bit constant The destination PC-address in a conditional branch is specified as a 16-bit constant, relative to the current PC A jump (j) instruction can specify a 26-bit constant; if more bits are required, the jump-register (jr) instruction is used
75
Starting a Program x.c Compiler x.s Assembler x.a, x.so x.o Linker
C Program x.c Compiler Assembly language program x.s Assembler x.a, x.so x.o Object: machine language module Object: library routine (machine language) Linker Executable: machine language program a.out Loader Memory
76
Role of Assembler Convert pseudo-instructions into actual hardware
instructions – pseudo-instrs make it easier to program in assembly – examples: “move”, “blt”, 32-bit immediate operands, etc. Convert assembly instrs into machine instrs – a separate object file (x.o) is created for each C file (x.c) – compute the actual values for instruction labels – maintain info on external references and debugging information
77
Role of Linker Stitches different object files into a single executable patch internal and external references determine addresses of data and instruction labels organize code and data modules in memory Some libraries (DLLs) are dynamically linked – the executable points to dummy routines – these dummy routines call the dynamic linker-loader so they can update the executable to jump to the correct routine
78
Full Example – Sort in C void sort (int v[], int n) { int i, j;
for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); } void swap (int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } Allocate registers to program variables Produce code for the program body Preserve registers across procedure invocations
79
The swap Procedure Register allocation: $a0 and $a1 for the two arguments, $t0 for the temp variable – no need for saves and restores as we’re not using $s0-$s7 and this is a leaf procedure (won’t need to re-use $a0 and $a1) swap: sll $t1, $a1, 2 add $t1, $a0, $t1 lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) jr $ra
80
The sort Procedure Register allocation: arguments v and n use $a0 and $a1, i and j use $s0 and $s1; must save $a0 and $a1 before calling the leaf procedure The outer for loop looks like this: (note the use of pseudo-instrs) move $s0, $zero # initialize the loop loopbody1: bge $s0, $a1, exit1 # will eventually use slt and beq … body of inner loop … addi $s0, $s0, 1 j loopbody1 exit1: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }
81
The sort Procedure The inner for loop looks like this:
addi $s1, $s0, # initialize the loop loopbody2: blt $s1, $zero, exit2 # will eventually use slt and beq sll $t1, $s1, 2 add $t2, $a0, $t1 lw $t3, 0($t2) lw $t4, 4($t2) bgt $t3, $t4, exit2 … body of inner loop … addi $s1, $s1, -1 j loopbody2 exit2: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }
82
Saves and Restores Since we repeatedly call “swap” with $a0 and $a1, we begin “sort” by copying its arguments into $s2 and $s3 – must update the rest of the code in “sort” to use $s2 and $s3 instead of $a0 and $a1 Must save $ra at the start of “sort” because it will get over-written when we call “swap” Must also save $s0-$s3 so we don’t overwrite something that belongs to the procedure that called “sort”
83
Saves and Restores sort: addi $sp, $sp, -20 sw $ra, 16($sp)
sw $s3, 12($sp) sw $s2, 8($sp) sw $s1, 4($sp) sw $s0, 0($sp) move $s2, $a0 move $s3, $a1 … move $a0, $s # the inner loop body starts here move $a1, $s1 jal swap exit1: lw $s0, 0($sp) addi $sp, $sp, 20 jr $ra 9 lines of C code 35 lines of assembly
84
Relative Performance Gcc optimization Relative Cycles Instruction CPI
performance count none B B O B B O B B O B B A Java interpreter has relative performance of 0.12, while the Jave just-in-time compiler has relative performance of 2.13 Note that the quicksort algorithm is about three orders of magnitude faster than the bubble sort algorithm (for 100K elements)
85
Lecture 5: MIPS Examples
Today’s topics: the compilation process full example – sort in C Reminder: 2nd assignment will be posted later today
86
Dealing with Characters
Instructions are also provided to deal with byte-sized and half-word quantities: lb (load-byte), sb, lh, sh These data types are most useful when dealing with characters, pixel values, etc. C employs ASCII formats to represent characters – each character is represented with 8 bits and a string ends in the null character (corresponding to the 8-bit number 0)
87
Example Convert to assembly: void strcpy (char x[], char y[]) { int i;
while ((x[i] = y[i]) != `\0’) i += 1; }
88
Example Convert to assembly: strcpy: void strcpy (char x[], char y[])
{ int i; i=0; while ((x[i] = y[i]) != `\0’) i += 1; } strcpy: addi $sp, $sp, -4 sw $s0, 0($sp) add $s0, $zero, $zero L1: add $t1, $s0, $a1 lb $t2, 0($t1) add $t3, $s0, $a0 sb $t2, 0($t3) beq $t2, $zero, L2 addi $s0, $s0, 1 j L1 L2: lw $s0, 0($sp) addi $sp, $sp, 4 jr $ra
89
Large Constants Immediate instructions can only specify 16-bit constants The lui instruction is used to store a 16-bit constant into the upper 16 bits of a register… thus, two immediate instructions are used to specify a 32-bit constant The destination PC-address in a conditional branch is specified as a 16-bit constant, relative to the current PC A jump (j) instruction can specify a 26-bit constant; if more bits are required, the jump-register (jr) instruction is used
90
Starting a Program x.c Compiler x.s Assembler x.a, x.so x.o Linker
C Program x.c Compiler Assembly language program x.s Assembler x.a, x.so x.o Object: machine language module Object: library routine (machine language) Linker Executable: machine language program a.out Loader Memory
91
Role of Assembler Convert pseudo-instructions into actual hardware
instructions – pseudo-instrs make it easier to program in assembly – examples: “move”, “blt”, 32-bit immediate operands, etc. Convert assembly instrs into machine instrs – a separate object file (x.o) is created for each C file (x.c) – compute the actual values for instruction labels – maintain info on external references and debugging information
92
Role of Linker Stitches different object files into a single executable patch internal and external references determine addresses of data and instruction labels organize code and data modules in memory Some libraries (DLLs) are dynamically linked – the executable points to dummy routines – these dummy routines call the dynamic linker-loader so they can update the executable to jump to the correct routine
93
Full Example – Sort in C void sort (int v[], int n) { int i, j;
for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); } void swap (int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } Allocate registers to program variables Produce code for the program body Preserve registers across procedure invocations
94
The swap Procedure void swap (int v[], int k) { int temp; temp = v[k];
v[k] = v[k+1]; v[k+1] = temp; } Allocate registers to program variables Produce code for the program body Preserve registers across procedure invocations
95
The swap Procedure Register allocation: $a0 and $a1 for the two arguments, $t0 for the temp variable – no need for saves and restores as we’re not using $s0-$s7 and this is a leaf procedure (won’t need to re-use $a0 and $a1) swap: sll $t1, $a1, 2 add $t1, $a0, $t1 lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) jr $ra void swap (int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; }
96
The sort Procedure Register allocation: arguments v and n use $a0 and $a1, i and j use $s0 and $s1 for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }
97
The sort Procedure Register allocation: arguments v and n use $a0 and $a1, i and j use $s0 and $s1; must save $a0, $a1, and $ra before calling the leaf procedure The outer for loop looks like this: (note the use of pseudo-instrs) move $s0, $zero # initialize the loop loopbody1: bge $s0, $a1, exit1 # will eventually use slt and beq … body of inner loop … addi $s0, $s0, 1 j loopbody1 exit1: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }
98
The sort Procedure The inner for loop looks like this:
addi $s1, $s0, # initialize the loop loopbody2: blt $s1, $zero, exit2 # will eventually use slt and beq sll $t1, $s1, 2 add $t2, $a0, $t1 lw $t3, 0($t2) lw $t4, 4($t2) bge $t4, $t3, exit2 … body of inner loop … addi $s1, $s1, -1 j loopbody2 exit2: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }
99
Saves and Restores Since we repeatedly call “swap” with $a0 and $a1, we begin “sort” by copying its arguments into $s2 and $s3 – must update the rest of the code in “sort” to use $s2 and $s3 instead of $a0 and $a1 Must save $ra at the start of “sort” because it will get over-written when we call “swap” Must also save $s0-$s3 so we don’t overwrite something that belongs to the procedure that called “sort”
100
Saves and Restores sort: addi $sp, $sp, -20 sw $ra, 16($sp)
sw $s3, 12($sp) sw $s2, 8($sp) sw $s1, 4($sp) sw $s0, 0($sp) move $s2, $a0 move $s3, $a1 … move $a0, $s # the inner loop body starts here move $a1, $s1 jal swap exit1: lw $s0, 0($sp) addi $sp, $sp, 20 jr $ra 9 lines of C code 35 lines of assembly for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }
101
Relative Performance Gcc optimization Relative Cycles Instruction CPI
performance count none B B O B B O B B O B B A Java interpreter has relative performance of 0.12, while the Jave just-in-time compiler has relative performance of 2.13 Note that the quicksort algorithm is about three orders of magnitude faster than the bubble sort algorithm (for 100K elements)
102
IA-32 Instruction Set Intel’s IA-32 instruction set has evolved over 20 years – old features are preserved for software compatibility Numerous complex instructions – complicates hardware design (Complex Instruction Set Computer – CISC) Instructions have different sizes, operands can be in registers or memory, only 8 general-purpose registers, one of the operands is over-written RISC instructions are more amenable to high performance (clock speed and parallelism) – modern Intel processors convert IA-32 instructions into simpler micro-operations
103
Lecture 6: Compilers, the SPIM Simulator
Today’s topics: SPIM simulator The compilation process Additional TA hours: Liqun Cheng, legion at cs, Office: MEB 2162 Office hours: Mon/Wed 11-12 TA hours for Josh: Wed 11:45-12:45 (EMCB 130) TA hours for Devyani: Wed 11:45-12:45 (MEB 3431)
104
IA-32 Instruction Set Intel’s IA-32 instruction set has evolved over 20 years – old features are preserved for software compatibility Numerous complex instructions – complicates hardware design (Complex Instruction Set Computer – CISC) Instructions have different sizes, operands can be in registers or memory, only 8 general-purpose registers, one of the operands is over-written RISC instructions are more amenable to high performance (clock speed and parallelism) – modern Intel processors convert IA-32 instructions into simpler micro-operations
105
SPIM SPIM is a simulator that reads in an assembly program
and models its behavior on a MIPS processor Note that a “MIPS add instruction” will eventually be converted to an add instruction for the host computer’s architecture – this translation happens under the hood To simplify the programmer’s task, it accepts pseudo-instructions, large constants, constants in decimal/hex formats, labels, etc. The simulator allows us to inspect register/memory values to confirm that our program is behaving correctly
106
Example This simple program (similar to what we’ve written in class) will run on SPIM (a “main” label is introduced so SPIM knows where to start) main: addi $t0, $zero, 5 addi $t1, $zero, 7 add $t2, $t0, $t1 If we inspect the contents of $t2, we’ll find the number 12
107
User Interface rajeev@trust > spim (spim) read “add.s” (spim) run
(spim) print $10 Reg 10 = 0x c (12) (spim) reinitialize (spim) step (spim) print $8 Reg 8 = 0x (5) (spim) print $9 Reg 9 = 0x (0) Reg 9 = 0x (7) (spim) exit File add.s main: addi $t0, $zero, 5 addi $t1, $zero, 7 add $t2, $t0, $t1
108
Directives File add.s Stack .text .globl main Dynamic data (heap)
addi $t0, $zero, 5 addi $t1, $zero, 7 add $t2, $t0, $t1 … jal swap_proc jr $ra .globl swap_proc swap_proc: Static data (globals) Text (instructions) This function is visible to other files
109
Directives File add.s Stack .data .word 5 Dynamic data (heap) .word 7
.byte 25 .asciiz “the answer is” .text .globl main main: lw $t0, 0($gp) lw $t1, 4($gp) add $t2, $t0, $t1 … jal swap_proc jr $ra Static data (globals) Text (instructions)
110
Labels File add.s Stack .data in1 .word 5 Dynamic data (heap)
c1 .byte 25 str .asciiz “the answer is” .text .globl main main: lw $t0, in1 lw $t1, in2 add $t2, $t0, $t1 … jal swap_proc jr $ra Static data (globals) Text (instructions)
111
Endian-ness Two major formats for transferring values between registers and memory Memory: low address b f high address Little-endian register: the first byte read goes in the low end of the register Register: 7f b 45 Most-significant bit Least-significant bit Big-endian register: the first byte read goes in the big end of the register Register: b f Most-significant bit Least-significant bit
112
System Calls SPIM provides some OS services: most useful are
operations for I/O: read, write, file open, file close The arguments for the syscall are placed in $a0-$a3 The type of syscall is identified by placing the appropriate number in $v0 – 1 for print_int, 4 for print_string, 5 for read_int, etc. $v0 is also used for the syscall’s return value
113
Example Print Routine .data str: .asciiz “the answer is ” .text
li $v0, # load immediate; 4 is the code for print_string la $a0, str # the print_string syscall expects the string # address as the argument; la is the instruction # to load the address of the operand (str) syscall # SPIM will now invoke syscall-4 li $v0, # syscall-1 corresponds to print_int li $a0, # print_int expects the integer as its argument syscall # SPIM will now invoke syscall-1
114
Example Write an assembly program to prompt the user for two numbers and print the sum of the two numbers
115
Example .text .data .globl main str1: .asciiz “Enter 2 numbers:”
main: str2: .asciiz “The sum is ” li $v0, 4 la $a0, str1 syscall li $v0, 5 add $t0, $v0, $zero add $t1, $v0, $zero la $a0, str2 li $v0, 1 add $a0, $t1, $t0
116
Compilation Steps The front-end: deals mostly with language specific actions Scanning: reads characters and breaks them into tokens Parsing: checks syntax Semantic analysis: makes sure operations/types are meaningful Intermediate representation: simple instructions, infinite registers, makes few assumptions about hw The back-end: optimizations and code generation Local optimizations: within a basic block Global optimizations: across basic blocks Register allocation
117
Dataflow Control flow graph: each box represents a basic block and
arcs represent potential jumps between instructions For each block, the compiler computes values that were defined (written to) and used (read from) Such dataflow analysis is key to several optimizations: for example, moving code around, eliminating dead code, removing redundant computations, etc.
118
Register Allocation The IR contains infinite virtual registers – these must be mapped to the architecture’s finite set of registers (say, 32 registers) For each virtual register, its live range is computed (the range between which the register is defined and used) We must now assign one of 32 colors to each virtual register so that intersecting live ranges are colored differently – can be mapped to the famous graph coloring problem If this is not possible, some values will have to be temporarily spilled to memory and restored (this is equivalent to breaking a single live range into smaller live ranges)
119
High-Level Optimizations
High-level optimizations are usually hardware independent Procedure inlining Loop unrolling Loop interchange, blocking (more on this later when we study cache/memory organization)
120
Low-Level Optimizations
Common sub-expression elimination Constant propagation Copy propagation Dead store/code elimination Code motion Induction variable elimination Strength reduction Pipeline scheduling
121
Lecture 7: Computer Arithmetic
Chapter : 3 Lecture 7: Computer Arithmetic Chapter 2 wrap-up Numerical representations Addition and subtraction
122
Compilation Steps The front-end: deals mostly with language specific actions Scanning: reads characters and breaks them into tokens Parsing: checks syntax Semantic analysis: makes sure operations/types are meaningful Intermediate representation: simple instructions, infinite registers, makes few assumptions about hw The back-end: optimizations and code generation Local optimizations: within a basic block Global optimizations: across basic blocks Register allocation
123
Dataflow Control flow graph: each box represents a basic block and
arcs represent potential jumps between instructions For each block, the compiler computes values that were defined (written to) and used (read from) Such dataflow analysis is key to several optimizations: for example, moving code around, eliminating dead code, removing redundant computations, etc.
124
Register Allocation The IR contains infinite virtual registers – these must be mapped to the architecture’s finite set of registers (say, 32 registers) For each virtual register, its live range is computed (the range between which the register is defined and used) We must now assign one of 32 colors to each virtual register so that intersecting live ranges are colored differently – can be mapped to the famous graph coloring problem If this is not possible, some values will have to be temporarily spilled to memory and restored (this is equivalent to breaking a single live range into smaller live ranges)
125
Graph Coloring VR1 VR1 VR2 VR2 VR3 VR4 VR3 VR4 VR1 VR2 VR3 VR4
126
High-Level Optimizations
High-level optimizations are usually hardware independent Procedure inlining Loop unrolling Loop interchange, blocking (more on this later when we study cache/memory organization)
127
Low-Level Optimizations
Common sub-expression elimination Constant propagation Copy propagation Dead store/code elimination Code motion Induction variable elimination Strength reduction Pipeline scheduling
128
Saves on Stack Caller saved
$a0-a3 -- old arguments must be saved before setting new arguments for the callee $ra -- must be saved before the jal instruction over-writes this value $t0-t9 -- if you plan to use your temps after the return, save them note that callees are free to use temps as they please You need not save $s0-s7 as the callee will take care of them Callee saved $s0-s7 -- before the callee uses such a register, it must save the old contents since the caller will usually need it on return local variables -- space is also created on the stack for variables local to that procedure
129
Binary Representation
The binary number represents the quantity 0 x x x … + 1 x 20 A 32-bit word can represent 232 numbers between 0 and … this is known as the unsigned representation as we’re assuming that numbers are always positive Most significant bit Least significant bit
130
ASCII Vs. Binary Does it make more sense to represent a decimal number
in ASCII? Hardware to implement arithmetic would be difficult What are the storage needs? How many bits does it take to represent the decimal number 1,000,000,000 in ASCII and in binary?
131
ASCII Vs. Binary Does it make more sense to represent a decimal number
in ASCII? Hardware to implement arithmetic would be difficult What are the storage needs? How many bits does it take to represent the decimal number 1,000,000,000 in ASCII and in binary? In binary: 30 bits (230 > 1 billion) In ASCII: 10 characters, 8 bits per char = 80 bits
132
Negative Numbers 32 bits can only represent 232 numbers – if we wish to also represent negative numbers, we can represent 231 positive numbers (incl zero) and 231 negative numbers two = 0ten two = 1ten … two = 231-1 two = -231 two = -(231 – 1) two = -(231 – 2) two = -2 two = -1
133
2’s Complement Why is this representation favorable?
two = 0ten two = 1ten … two = 231-1 two = -231 two = -(231 – 1) two = -(231 – 2) two = -2 two = -1 Why is this representation favorable? Consider the sum of 1 and -2 …. we get -1 Consider the sum of 2 and -1 …. we get +1 This format can directly undergo addition without any conversions! Each number represents the quantity x x x … + x x0 20
134
2’s Complement two = 0ten two = 1ten … two = 231-1 two = -231 two = -(231 – 1) two = -(231 – 2) two = -2 two = -1 Note that the sum of a number x and its inverted representation x’ always equals a string of 1s (-1). x + x’ = -1 x’ + 1 = -x … hence, can compute the negative of a number by -x = x’ inverting all bits and adding 1 Similarly, the sum of x and –x gives us all zeroes, with a carry of 1 In reality, x + (-x) = 2n … hence the name 2’s complement
135
Example Compute the 32-bit 2’s complement representations
for the following decimal numbers: 5, -5, -6
136
Example Compute the 32-bit 2’s complement representations
for the following decimal numbers: 5, -5, -6 5: -5: -6: Given -5, verify that negating and adding 1 yields the number 5
137
Signed / Unsigned The hardware recognizes two formats:
unsigned (corresponding to the C declaration unsigned int) -- all numbers are positive, a 1 in the most significant bit just means it is a really large number signed (C declaration is signed int or just int) -- numbers can be +/- , a 1 in the MSB means the number is negative This distinction enables us to represent twice as many numbers when we’re sure that we don’t need negatives
138
MIPS Instructions Consider a comparison instruction:
slt $t0, $t1, $zero and $t1 contains the 32-bit number …01 What gets stored in $t0?
139
MIPS Instructions Consider a comparison instruction:
slt $t0, $t1, $zero and $t1 contains the 32-bit number …01 What gets stored in $t0? The result depends on whether $t1 is a signed or unsigned number – the compiler/programmer must track this and accordingly use either slt or sltu slt $t0, $t1, $zero stores 1 in $t0 sltu $t0, $t1, $zero stores 0 in $t0
140
The Bounds Check Shortcut
Suppose we want to check if 0 <= x < y and x and y are signed numbers (stored in $a1 and $t2) The following single comparison can check both conditions sltu $t0, $a1, $t2 beq $t0, $zero, EitherConditionFails We know that $t2 begins with a 0 If $a1 begins with a 0, sltu is effectively checking the second condition If $a1 begins with a 1, we want the condition to fail and coincidentally, sltu is guaranteed to fail in this case
141
Sign Extension Occasionally, 16-bit signed numbers must be converted
into 32-bit signed numbers – for example, when doing an add with an immediate operand The conversion is simple: take the most significant bit and use it to fill up the additional bits on the left – known as sign extension So 210 goes from to and -210 goes from to
142
Alternative Representations
The following two (intuitive) representations were discarded because they required additional conversion steps before arithmetic could be performed on the numbers sign-and-magnitude: the most significant bit represents +/- and the remaining bits express the magnitude one’s complement: -x is represented by inverting all the bits of x Both representations above suffer from two zeroes
143
Addition and Subtraction
Addition is similar to decimal arithmetic For subtraction, simply add the negative number – hence, subtract A-B involves negating B’s bits, adding 1 and A
144
Overflows For an unsigned number, overflow happens when the last carry (1) cannot be accommodated For a signed number, overflow happens when the most significant bit is not the same as every bit to its left when the sum of two positive numbers is a negative result when the sum of two negative numbers is a positive result The sum of a positive and negative number will never overflow MIPS allows addu and subu instructions that work with unsigned integers and never flag an overflow – to detect the overflow, other instructions will have to be executed
145
Lecture 8: Binary Multiplication & Division
Today’s topics: Addition/Subtraction Multiplication Division Reminder: get started early on assignment 3
146
2’s Complement – Signed Numbers
two = 0ten two = 1ten … two = 231-1 two = -231 two = -(231 – 1) two = -(231 – 2) two = -2 two = -1 Why is this representation favorable? Consider the sum of 1 and -2 …. we get -1 Consider the sum of 2 and -1 …. we get +1 This format can directly undergo addition without any conversions! Each number represents the quantity x x x … + x x0 20
147
Alternative Representations
The following two (intuitive) representations were discarded because they required additional conversion steps before arithmetic could be performed on the numbers sign-and-magnitude: the most significant bit represents +/- and the remaining bits express the magnitude one’s complement: -x is represented by inverting all the bits of x Both representations above suffer from two zeroes
148
Addition and Subtraction
Addition is similar to decimal arithmetic For subtraction, simply add the negative number – hence, subtract A-B involves negating B’s bits, adding 1 and A
149
Overflows For an unsigned number, overflow happens when the last carry (1) cannot be accommodated For a signed number, overflow happens when the most significant bit is not the same as every bit to its left when the sum of two positive numbers is a negative result when the sum of two negative numbers is a positive result The sum of a positive and negative number will never overflow MIPS allows addu and subu instructions that work with unsigned integers and never flag an overflow – to detect the overflow, other instructions will have to be executed
150
Multiplication Example
Multiplicand ten Multiplier x ten 1000 0000 Product ten In every step multiplicand is shifted next bit of multiplier is examined (also a shifting step) if this bit is 1, shifted multiplicand is added to the product
151
HW Algorithm 1 In every step multiplicand is shifted
next bit of multiplier is examined (also a shifting step) if this bit is 1, shifted multiplicand is added to the product
152
HW Algorithm 2 32-bit ALU and multiplicand is untouched
the sum keeps shifting right at every step, number of bits in product + multiplier = 64, hence, they share a single 64-bit register
153
Notes The previous algorithm also works for signed numbers
(negative numbers in 2’s complement form) We can also convert negative numbers to positive, multiply the magnitudes, and convert to negative if signs disagree The product of two 32-bit numbers can be a 64-bit number -- hence, in MIPS, the product is saved in two 32-bit registers
154
MIPS Instructions mult $s2, $s3 computes the product and stores
it in two “internal” registers that can be referred to as hi and lo mfhi $s moves the value in hi into $s0 mflo $s moves the value in lo into $s1 Similarly for multu
155
Fast Algorithm The previous algorithm requires a clock to ensure that
the earlier addition has completed before shifting This algorithm can quickly set up most inputs – it then has to wait for the result of each add to propagate down – faster because no clock is involved -- Note: high transistor cost
156
Division 1001ten Quotient Divisor 1000ten | 1001010ten Dividend -1000
10ten Remainder At every step, shift divisor right and compare it with current dividend if divisor is larger, shift 0 as the next bit of the quotient if divisor is smaller, subtract to get new dividend and shift 1 as the next bit of the quotient
157
Division 1001ten Quotient Divisor 1000ten | 1001010ten Dividend
Quo: At every step, shift divisor right and compare it with current dividend if divisor is larger, shift 0 as the next bit of the quotient if divisor is smaller, subtract to get new dividend and shift 1 as the next bit of the quotient
158
Divide Example Divide 7ten (0000 0111two) by 2ten (0010two) Iter Step
Quot Divisor Remainder Initial values 1 2 3 4 5
159
Divide Example Divide 7ten (0000 0111two) by 2ten (0010two) Iter Step
Quot Divisor Remainder Initial values 0000 1 Rem = Rem – Div Rem < 0 +Div, shift 0 into Q Shift Div right 2 Same steps as 1 3 4 Rem >= 0 shift 1 into Q 0001 5 Same steps as 4 0011
160
Hardware for Division A comparison requires a subtract; the sign of the result is examined; if the result is negative, the divisor must be added back
161
Efficient Division
162
Divisions involving Negatives
Simplest solution: convert to positive and adjust sign later Note that multiple solutions exist for the equation: Dividend = Quotient x Divisor + Remainder +7 div Quo = Rem = -7 div Quo = Rem = +7 div Quo = Rem = -7 div Quo = Rem =
163
Divisions involving Negatives
Simplest solution: convert to positive and adjust sign later Note that multiple solutions exist for the equation: Dividend = Quotient x Divisor + Remainder +7 div Quo = Rem = +1 -7 div Quo = Rem = -1 +7 div Quo = Rem = +1 -7 div Quo = Rem = -1 Convention: Dividend and remainder have the same sign Quotient is negative if signs disagree These rules fulfil the equation above
164
Lecture 9: Floating Point
Division FP arithmetic
165
Division 1001ten Quotient Divisor 1000ten | 1001010ten Dividend -1000
10ten Remainder At every step, shift divisor right and compare it with current dividend if divisor is larger, shift 0 as the next bit of the quotient if divisor is smaller, subtract to get new dividend and shift 1 as the next bit of the quotient
166
Divide Example Divide 7ten (0000 0111two) by 2ten (0010two) Iter Step
Quot Divisor Remainder Initial values 0000 1 Rem = Rem – Div Rem < 0 +Div, shift 0 into Q Shift Div right 2 Same steps as 1 3 4 Rem >= 0 shift 1 into Q 0001 5 Same steps as 4 0011
167
Efficient Division
168
Divisions involving Negatives
Simplest solution: convert to positive and adjust sign later Note that multiple solutions exist for the equation: Dividend = Quotient x Divisor + Remainder +7 div Quo = Rem = -7 div Quo = Rem = +7 div Quo = Rem = -7 div Quo = Rem =
169
Divisions involving Negatives
Simplest solution: convert to positive and adjust sign later Note that multiple solutions exist for the equation: Dividend = Quotient x Divisor + Remainder +7 div Quo = Rem = +1 -7 div Quo = Rem = -1 +7 div Quo = Rem = +1 -7 div Quo = Rem = -1 Convention: Dividend and remainder have the same sign Quotient is negative if signs disagree These rules fulfil the equation above
170
Floating Point Normalized scientific notation: single non-zero digit to the left of the decimal (binary) point – example: 3.5 x 109 x 2-5two = (1 + 0 x x … + 1 x 2-6) x 2-5ten A standard notation enables easy exchange of data between machines and simplifies hardware algorithms – the IEEE 754 standard defines how floating point numbers are represented
171
Sign and Magnitude Representation
Sign Exponent Fraction 1 bit bits bits S E F More exponent bits wider range of numbers (not necessarily more numbers – recall there are infinite real numbers) More fraction bits higher precision Register value = (-1)S x F x 2E Since we are only representing normalized numbers, we are guaranteed that the number is of the form 1.xxxx.. Hence, in IEEE 754 standard, the 1 is implicit Register value = (-1)S x (1 + F) x 2E
172
Sign and Magnitude Representation
Sign Exponent Fraction 1 bit bits bits S E F Largest number that can be represented: Smallest number that can be represented:
173
Sign and Magnitude Representation
Sign Exponent Fraction 1 bit bits bits S E F Largest number that can be represented: 2.0 x 2128 = 2.0 x 1038 Smallest number that can be represented: 2.0 x = 2.0 x 10-38 Overflow: when representing a number larger than the one above; Underflow: when representing a number smaller than the one above Double precision format: occupies two 32-bit registers: Largest: Smallest: Sign Exponent Fraction 1 bit bits bits S E F
174
Details The number “0” has a special code so that the implicit 1 does not get added: the code is all 0s (it may seem that this takes up the representation for 1.0, but given how the exponent is represented, we’ll soon see that that’s not the case) The largest exponent value (with zero fraction) represents +/- infinity The largest exponent value (with non-zero fraction) represents NaN (not a number) – for the result of 0/0 or (infinity minus infinity)
175
Exponent Representation
To simplify sort, sign was placed as the first bit For a similar reason, the representation of the exponent is also modified: in order to use integer compares, it would be preferable to have the smallest exponent as 00…0 and the largest exponent as 11…1 This is the biased notation, where a bias is subtracted from the exponent field to yield the true exponent IEEE 754 single-precision uses a bias of 127 (since the exponent must have values between -127 and 128)…double precision uses a bias of 1023 Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias)
176
Examples Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias) Represent ten in single and double-precision formats Single: ( ) Double: ( ) What decimal number is represented by the following single-precision number? …0000
177
Examples Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias) Represent ten in single and double-precision formats Single: ( ) …000 Double: ( ) …000 What decimal number is represented by the following single-precision number? …0000 -5.0
178
FP Addition Consider the following decimal example (can maintain
only 4 decimal digits and 2 exponent digits) x x 10-1 Convert to the larger exponent: x x 101 Add x 101 Normalize x 102 Check for overflow/underflow Round x 102 Re-normalize
179
FP Addition Consider the following decimal example (can maintain
only 4 decimal digits and 2 exponent digits) x x 10-1 Convert to the larger exponent: x x 101 Add x 101 Normalize x 102 Check for overflow/underflow Round x 102 Re-normalize If we had more fraction bits, these errors would be minimized
180
FP Multiplication Similar steps: Compute exponent (careful!)
Multiply significands (set the binary point correctly) Normalize Round (potentially re-normalize) Assign sign
181
MIPS Instructions The usual add.s, add.d, sub, mul, div
Comparison instructions: c.eq.s, c.neq.s, c.lt.s…. These comparisons set an internal bit in hardware that is then inspected by branch instructions: bc1t, bc1f Separate register file $f0 - $f31 : a double-precision value is stored in (say) $f4-$f5 and is referred to by $f4 Load/store instructions (lwc1, swc1) must still use integer registers for address computation
182
Code Example float f2c (float fahr) {
return ((5.0/9.0) * (fahr – 32.0)); } (argument fahr is stored in $f12) lwc1 $f16, const5($gp) lwc1 $f18, const9($gp) div.s $f16, $f16, $f18 lwc1 $f18, const32($gp) sub.s $f18, $f12, $f18 mul.s $f0, $f16, $f18 jr $ra
183
Lecture 10: FP, Performance Metrics
Chapter : 4 Lecture 10: FP, Performance Metrics FP arithmetic Evaluating a system
184
Examples Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias) Represent ten in single and double-precision formats Single: ( ) Double: ( ) What decimal number is represented by the following single-precision number? …0000
185
Examples Final representation: (-1)S x (1 + Fraction) x 2(Exponent – Bias) Represent ten in single and double-precision formats Single: ( ) …000 Double: ( ) …000 What decimal number is represented by the following single-precision number? …0000 -5.0
186
FP Addition Consider the following decimal example (can maintain
only 4 decimal digits and 2 exponent digits) x x 10-1 Convert to the larger exponent: x x 101 Add x 101 Normalize x 102 Check for overflow/underflow Round x 102 Re-normalize
187
FP Addition Consider the following decimal example (can maintain
only 4 decimal digits and 2 exponent digits) x x 10-1 Convert to the larger exponent: x x 101 Add x 101 Normalize x 102 Check for overflow/underflow Round x 102 Re-normalize If we had more fraction bits, these errors would be minimized
188
FP Multiplication Similar steps: Compute exponent (careful!)
Multiply significands (set the binary point correctly) Normalize Round (potentially re-normalize) Assign sign
189
MIPS Instructions The usual add.s, add.d, sub, mul, div
Comparison instructions: c.eq.s, c.neq.s, c.lt.s…. These comparisons set an internal bit in hardware that is then inspected by branch instructions: bc1t, bc1f Separate register file $f0 - $f31 : a double-precision value is stored in (say) $f4-$f5 and is referred to by $f4 Load/store instructions (lwc1, swc1) must still use integer registers for address computation
190
Code Example float f2c (float fahr) {
return ((5.0/9.0) * (fahr – 32.0)); } (argument fahr is stored in $f12) lwc1 $f16, const5($gp) lwc1 $f18, const9($gp) div.s $f16, $f16, $f18 lwc1 $f18, const32($gp) sub.s $f18, $f12, $f18 mul.s $f0, $f16, $f18 jr $ra
191
Performance Metrics Possible measures:
response time – time elapsed between start and end of a program throughput – amount of work done in a fixed time The two measures are usually linked A faster processor will improve both More processors will likely only improve throughput What influences performance?
192
Execution Time Consider a system X executing a fixed workload W
PerformanceX = 1 / Execution timeX Execution time = response time = wall clock time - Note that this includes time to execute the workload as well as time spent by the operating system co-ordinating various events The UNIX “time” command breaks up the wall clock time as user and system time
193
Speedup and Improvement
System X executes a program in 10 seconds, system Y executes the same program in 15 seconds System X is 1.5 times faster than system Y The speedup of system X over system Y is 1.5 (the ratio) The performance improvement of X over Y is = 0.5 = 50% The execution time reduction for the program, compared to Y is (15-10) / 15 = 33% The execution time increase, compared to X is (15-10) / 10 = 50%
194
Performance Equation - I
CPU execution time = CPU clock cycles x Clock cycle time Clock cycle time = 1 / Clock speed If a processor has a frequency of 3 GHz, the clock ticks 3 billion times in a second – as we’ll soon see, with each clock tick, one or more/less instructions may complete If a program runs for 10 seconds on a 3 GHz processor, how many clock cycles did it run for? If a program runs for 2 billion clock cycles on a 1.5 GHz processor, what is the execution time in seconds?
195
Performance Equation - II
CPU clock cycles = number of instrs x avg clock cycles per instruction (CPI) Substituting in previous equation, Execution time = clock cycle time x number of instrs x avg CPI If a 2 GHz processor graduates an instruction every third cycle, how many instructions are there in a program that runs for 10 seconds?
196
Factors Influencing Performance
Execution time = clock cycle time x number of instrs x avg CPI Clock cycle time: manufacturing process (how fast is each transistor), how much work gets done in each pipeline stage (more on this later) Number of instrs: the quality of the compiler and the instruction set architecture CPI: the nature of each instruction and the quality of the architecture implementation
197
Example Execution time = clock cycle time x number of instrs x avg CPI
Which of the following two systems is better? A program is converted into 4 billion MIPS instructions by a compiler ; the MIPS processor is implemented such that each instruction completes in an average of 1.5 cycles and the clock speed is 1 GHz The same program is converted into 2 billion x86 instructions; the x86 processor is implemented such that each instruction completes in an average of 6 cycles and the clock speed is 1.5 GHz
198
Benchmark Suites Measuring performance components is difficult for most users: average CPI requires simulation/hardware counters, instruction count requires profiling tools/hardware counters, OS interference is hard to quantify, etc. Each vendor announces a SPEC rating for their system a measure of execution time for a fixed collection of programs is a function of a specific CPU, memory system, IO system, operating system, compiler enables easy comparison of different systems The key is coming up with a collection of relevant programs
199
SPEC CPU SPEC: System Performance Evaluation Corporation, an industry
consortium that creates a collection of relevant programs The 2006 version includes 12 integer and 17 floating-point applications The SPEC rating specifies how much faster a system is, compared to a baseline machine – a system with SPEC rating 600 is 1.5 times faster than a system with SPEC rating 400 Note that this rating incorporates the behavior of all 29 programs – this may not necessarily predict performance for your favorite program!
200
Deriving a Single Performance Number
How is the performance of 29 different apps compressed into a single performance number? SPEC uses geometric mean (GM) – the execution time of each program is multiplied and the Nth root is derived Another popular metric is arithmetic mean (AM) – the average of each program’s execution time Weighted arithmetic mean – the execution times of some programs are weighted to balance priorities
201
Amdahl’s Law Architecture design is very bottleneck-driven – make the
common case fast, do not waste resources on a component that has little impact on overall performance/power Amdahl’s Law: performance improvements through an enhancement is limited by the fraction of time the enhancement comes into play Example: a web server spends 40% of time in the CPU and 60% of time doing I/O – a new processor that is ten times faster results in a 36% reduction in execution time (speedup of 1.56) – Amdahl’s Law states that maximum execution time reduction is 40% (max speedup of 1.66)
202
Lecture 11: Digital Design
Evaluating a system Intro to boolean functions
203
Example Execution time = clock cycle time x number of instrs x avg CPI
Which of the following two systems is better? A program is converted into 4 billion MIPS instructions by a compiler ; the MIPS processor is implemented such that each instruction completes in an average of 1.5 cycles and the clock speed is 1 GHz The same program is converted into 2 billion x86 instructions; the x86 processor is implemented such that each instruction completes in an average of 6 cycles and the clock speed is 1.5 GHz
204
Benchmark Suites Measuring performance components is difficult for most users: average CPI requires simulation/hardware counters, instruction count requires profiling tools/hardware counters, OS interference is hard to quantify, etc. Each vendor announces a SPEC rating for their system a measure of execution time for a fixed collection of programs is a function of a specific CPU, memory system, IO system, operating system, compiler enables easy comparison of different systems The key is coming up with a collection of relevant programs
205
SPEC CPU SPEC: System Performance Evaluation Corporation, an industry
consortium that creates a collection of relevant programs The 2006 version includes 12 integer and 17 floating-point applications The SPEC rating specifies how much faster a system is, compared to a baseline machine – a system with SPEC rating 600 is 1.5 times faster than a system with SPEC rating 400 Note that this rating incorporates the behavior of all 29 programs – this may not necessarily predict performance for your favorite program!
206
Deriving a Single Performance Number
How is the performance of 29 different apps compressed into a single performance number? SPEC uses geometric mean (GM) – the execution time of each program is multiplied and the Nth root is derived Another popular metric is arithmetic mean (AM) – the average of each program’s execution time Weighted arithmetic mean – the execution times of some programs are weighted to balance priorities
207
Amdahl’s Law Architecture design is very bottleneck-driven – make the
common case fast, do not waste resources on a component that has little impact on overall performance/power Amdahl’s Law: performance improvements through an enhancement is limited by the fraction of time the enhancement comes into play Example: a web server spends 40% of time in the CPU and 60% of time doing I/O – a new processor that is ten times faster results in a 36% reduction in execution time (speedup of 1.56) – Amdahl’s Law states that maximum execution time reduction is 40% (max speedup of 1.66)
208
Digital Design Basics Two voltage levels – high and low (1 and 0, true and false) Hence, the use of binary arithmetic/logic in all computers A transistor is a 3-terminal device that acts as a switch V V V V Conducting Non-conducting
209
Logic Blocks A logic block has a number of binary inputs and produces
a number of binary outputs – the simplest logic block is composed of a few transistors A logic block is termed combinational if the output is only a function of the inputs A logic block is termed sequential if the block has some internal memory (state) that also influences the output A basic logic block is termed a gate (AND, OR, NOT, etc.) We will only deal with combinational circuits today
210
Truth Table A truth table defines the outputs of a logic block for each set of inputs Consider a block with 3 inputs A, B, C and an output E that is true only if exactly 2 inputs are true A B C E
211
Truth Table A truth table defines the outputs of a logic block for each set of inputs Consider a block with 3 inputs A, B, C and an output E that is true only if exactly 2 inputs are true A B C E Can be compressed by only representing cases that have an output of 1
212
Boolean Algebra Equations involving two values and three primary operators: OR : symbol + , X = A + B X is true if at least one of A or B is true AND : symbol . , X = A . B X is true if both A and B are true NOT : symbol , X = A X is the inverted value of A
213
Boolean Algebra Rules Identity law : A + 0 = A ; A . 1 = A
Zero and One laws : A + 1 = 1 ; A . 0 = 0 Inverse laws : A . A = 0 ; A + A = 1 Commutative laws : A + B = B + A ; A . B = B . A Associative laws : A + (B + C) = (A + B) + C A . (B . C) = (A . B) . C Distributive laws : A . (B + C) = (A . B) + (A . C) A + (B . C) = (A + B) . (A + C)
214
DeMorgan’s Laws A + B = A . B A . B = A + B
Confirm that these are indeed true
215
Pictorial Representations
AND OR NOT What logic function is this?
216
Boolean Equation Consider the logic block that has an output E that is true only if exactly two of the three inputs A, B, C are true
217
Boolean Equation Consider the logic block that has an output E that is true only if exactly two of the three inputs A, B, C are true Multiple correct equations: Two must be true, but all three cannot be true: E = ((A . B) + (B . C) + (A . C)) . (A . B . C) Identify the three cases where it is true: E = (A . B . C) + (A . C . B) + (C . B . A)
218
Sum of Products Can represent any logic block with the AND, OR, NOT operators Draw the truth table For each true output, represent the corresponding inputs as a product The final equation is a sum of these products A B C E (A . B . C) + (A . C . B) + (C . B . A) Can also use “product of sums” Any equation can be implemented with an array of ANDs, followed by an array of ORs
219
NAND and NOR NAND : NOT of AND : A nand B = A . B
NOR : NOT of OR : A nor B = A + B NAND and NOR are universal gates, i.e., they can be used to construct any complex logical function
220
Common Logic Blocks – Decoder
Takes in N inputs and activates one of 2N outputs I0 I1 I O0 O1 O2 O3 O4 O5 O6 O7 3-to-8 Decoder I0-2 O0-7
221
Common Logic Blocks – Multiplexor
Multiplexor or selector: one of N inputs is reflected on the output depending on the value of the log2N selector bits 2-input mux
222
Lecture 12: Hardware for Arithmetic
Designing an ALU Carry-lookahead adder
223
DeMorgan’s Laws A + B = A . B A . B = A + B
Confirm that these are indeed true
224
Sum of Products Can represent any logic block with the AND, OR, NOT operators Draw the truth table For each true output, represent the corresponding inputs as a product The final equation is a sum of these products A B C E (A . B . C) + (A . C . B) + (C . B . A) Can also use “product of sums” Any equation can be implemented with an array of ANDs, followed by an array of ORs
225
Adder Algorithm 1 0 0 1 0 1 0 1 Sum 1 1 1 0 Carry 0 0 0 1
Sum Carry Truth Table for the above operations: A B Cin Sum Cout
226
Adder Algorithm 1 0 0 1 0 1 0 1 Sum 1 1 1 0 Carry 0 0 0 1
Sum Carry Equations: Sum = Cin . A . B + B . Cin . A + A . Cin . B + A . B . Cin Cout = A . B . Cin + A . B . Cin + B . Cin . A = A . B + A . Cin + B . Cin Truth Table for the above operations: A B Cin Sum Cout
227
Carry Out Logic Equations: Sum = Cin . A . B + B . Cin . A +
A . Cin . B + A . B . Cin Cout = A . B . Cin + A . B . Cin + B . Cin . A = A . B + A . Cin + B . Cin
228
1-Bit ALU with Add, Or, And Multiplexor selects between Add, Or, And operations
229
32-bit Ripple Carry Adder
1-bit ALUs are connected “in series” with the carry-out of 1 box going into the carry-in of the next box
230
Incorporating Subtraction
Must invert bits of B and add a 1 Include an inverter CarryIn for the first bit is 1 The CarryIn signal (for the first bit) can be the same as the Binvert signal
231
Incorporating NOR
232
Incorporating slt Perform a – b and check the sign
New signal (Less) that is zero for ALU boxes 1-31 The 31st box has a unit to detect overflow and sign – the sign bit serves as the Less signal for the 0th box
233
Incorporating beq Perform a – b and confirm that the
result is all zero’s
234
Control Lines What are the values of the control lines
and what operations do they correspond to?
235
Control Lines What are the values of the control lines
and what operations do they correspond to? Ai Bn Op AND OR Add Sub SLT NOR
236
Speed of Ripple Carry The carry propagates thru every 1-bit box: each 1-bit box sequentially implements AND and OR – total delay is the time to go through 64 gates! We’ve already seen that any logic equation can be expressed as the sum of products – so it should be possible to compute the result by going through only 2 gates! Caveat: need many parallel gates and each gate may have a very large number of inputs – it is difficult to efficiently build such large gates, so we’ll find a compromise: moderate number of gates moderate number of inputs to each gate moderate number of sequential gates traversed
237
Computing CarryOut CarryIn1 = b0.CarryIn0 + a0.CarryIn0 + a0.b0
= b1.b0.c0 + b1.a0.c0 + b1.a0.b0 + a1.b0.c0 + a1.a0.c0 + a1.a0.b0 + a1.b1 … CarryIn32 = a really large sum of really large products Potentially fast implementation as the result is computed by going thru just 2 levels of logic – unfortunately, each gate is enormous and slow
238
Generate and Propagate
Equation re-phrased: Ci+1 = ai.bi + ai.Ci + bi.Ci = (ai.bi) + (ai + bi).Ci Stated verbally, the current pair of bits will generate a carry if they are both 1 and the current pair of bits will propagate a carry if either is 1 Generate signal = ai.bi Propagate signal = ai + bi Therefore, Ci+1 = Gi + Pi . Ci
239
Generate and Propagate
c1 = g0 + p0.c0 c2 = g1 + p1.c1 = g1 + p1.g0 + p1.p0.c0 c3 = g2 + p2.g1 + p2.p1.g0 + p2.p1.p0.c0 c4 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0 + p3.p2.p1.p0.c0 Either, a carry was just generated, or a carry was generated in the last step and was propagated, or a carry was generated two steps back and was propagated by both the next two stages, or a carry was generated N steps back and was propagated by every single one of the N next stages
240
Divide and Conquer The equations on the previous slide are still difficult to implement as logic functions – for the 32nd bit, we must AND every single propagate bit to determine what becomes of c0 (among other things) Hence, the bits are broken into groups (of 4) and each group computes its group-generate and group-propagate For example, to add 32 numbers, you can partition the task as a tree .
241
P and G for 4-bit Blocks Compute P0 and G0 (super-propagate and super-generate) for the first group of 4 bits (and similarly for other groups of 4 bits) P0 = p0.p1.p2.p3 G0 = g3 + g2.p3 + g1.p2.p3 + g0.p1.p2.p3 Carry out of the first group of 4 bits is C1 = G0 + P0.c0 C2 = G1 + P1.G0 + P1.P0.c0 … By having a tree of sub-computations, each AND, OR gate has few inputs and logic signals have to travel through a modest set of gates (equal to the height of the tree)
242
Example Add A and B g p P G C4 = 1
243
Carry Look-Ahead Adder
16-bit Ripple-carry takes 32 steps This design takes how many steps?
244
Lecture 13: Sequential Circuits
Carry-lookahead adder Clocks and sequential circuits Finite state machines
245
Speed of Ripple Carry The carry propagates thru every 1-bit box: each 1-bit box sequentially implements AND and OR – total delay is the time to go through 64 gates! We’ve already seen that any logic equation can be expressed as the sum of products – so it should be possible to compute the result by going through only 2 gates! Caveat: need many parallel gates and each gate may have a very large number of inputs – it is difficult to efficiently build such large gates, so we’ll find a compromise: moderate number of gates moderate number of inputs to each gate moderate number of sequential gates traversed
246
Computing CarryOut CarryIn1 = b0.CarryIn0 + a0.CarryIn0 + a0.b0
= b1.b0.c0 + b1.a0.c0 + b1.a0.b0 + a1.b0.c0 + a1.a0.c0 + a1.a0.b0 + a1.b1 … CarryIn32 = a really large sum of really large products Potentially fast implementation as the result is computed by going thru just 2 levels of logic – unfortunately, each gate is enormous and slow
247
Generate and Propagate
Equation re-phrased: ci+1 = ai.bi + ai.ci + bi.ci = (ai.bi) + (ai + bi).ci Stated verbally, the current pair of bits will generate a carry if they are both 1 and the current pair of bits will propagate a carry if either is 1 Generate signal = ai.bi Propagate signal = ai + bi Therefore, ci+1 = gi + pi . ci
248
Generate and Propagate
c1 = g0 + p0.c0 c2 = g1 + p1.c1 = g1 + p1.g0 + p1.p0.c0 c3 = g2 + p2.g1 + p2.p1.g0 + p2.p1.p0.c0 c4 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0 + p3.p2.p1.p0.c0 Either, a carry was just generated, or a carry was generated in the last step and was propagated, or a carry was generated two steps back and was propagated by both the next two stages, or a carry was generated N steps back and was propagated by every single one of the N next stages
249
Divide and Conquer The equations on the previous slide are still difficult to implement as logic functions – for the 32nd bit, we must AND every single propagate bit to determine what becomes of c0 (among other things) Hence, the bits are broken into groups (of 4) and each group computes its group-generate and group-propagate For example, to add 32 numbers, you can partition the task as a tree .
250
P and G for 4-bit Blocks Compute P0 and G0 (super-propagate and super-generate) for the first group of 4 bits (and similarly for other groups of 4 bits) P0 = p0.p1.p2.p3 G0 = g3 + g2.p3 + g1.p2.p3 + g0.p1.p2.p3 Carry out of the first group of 4 bits is C1 = G0 + P0.c0 C2 = G1 + P1.G0 + P1.P0.c0 … By having a tree of sub-computations, each AND, OR gate has few inputs and logic signals have to travel through a modest set of gates (equal to the height of the tree)
251
Example Add A and B g p P G C4 = 1
252
Carry Look-Ahead Adder
16-bit Ripple-carry takes 32 steps This design takes how many steps?
253
Clocks A microprocessor is composed of many different circuits
that are operating simultaneously – if each circuit X takes in inputs at time TIX, takes time TEX to execute the logic, and produces outputs at time TOX, imagine the complications in co-ordinating the tasks of every circuit A major school of thought (used in most processors built today): all circuits on the chip share a clock signal (a square wave) that tells every circuit when to accept inputs, how much time they have to execute the logic, and when they must produce outputs
254
Clock Terminology Rising clock edge Cycle time Falling clock edge
4 GHz = clock speed = = cycle time ps
255
Sequential Circuits Until now, circuits were combinational – when inputs change, the outputs change after a while (time = logic delay thru circuit) Combinational Circuit Combinational Circuit Inputs Outputs We want the clock to act like a start and stop signal – a “latch” is a storage device that stores its inputs at a rising clock edge and this storage will not change until the next rising clock edge Clock Clock Combinational Circuit Combinational Circuit Outputs Inputs Latch Latch
256
Sequential Circuits Sequential circuit: consists
of combinational circuit and a storage element At the start of the clock cycle, the rising edge causes the “state” storage to store some input values This state will not change for an entire cycle (until next rising edge) The combinational circuit has some time to accept the value of “state” and “inputs” and produce “outputs” Some of the outputs (for example, the value of next “state”) may feed back (but through the latch so they’re only seen in the next cycle Inputs State Clock Outputs Inputs Combinational Cct
257
Designing a Latch An S-R latch: set-reset latch
When Set is high, a 1 is stored When Reset is high, a 0 is stored When both are low, the previous state is preserved (hence, known as a storage or memory element) When both are high, the output is unstable – this set of inputs is therefore not allowed Verify the above behavior!
258
D Latch Incorporates a clock
The value of the input D signal (data) is stored only when the clock is high – the previous state is preserved when the clock is low
259
D Flip Flop Terminology:
Latch: outputs can change any time the clock is high (asserted) Flip flop: outputs can change only on a clock edge Two D latches in series – ensures that a value is stored only on the falling edge of the clock
260
Sequential Circuits We want the clock to act like a start and stop signal – a “latch” is a storage device that stores its inputs at a rising clock edge and this storage will not change until the next rising clock edge Clock Clock Combinational Circuit Combinational Circuit Outputs Inputs Latch Latch
261
Finite State Machine A sequential circuit is described by a variation of a truth table – a finite state diagram (hence, the circuit is also called a finite state machine) Note that state is updated only on a clock edge Next state Next-state Function Current State Clock Output Function Outputs Inputs
262
State Diagrams Each state is shown with a circle, labeled with the state value – the contents of the circle are the outputs An arc represents a transition to a different state, with the inputs indicated on the label D = 0 D = 1 This is a state diagram for ___? D = 1 1 1 D = 0
263
3-Bit Counter Consider a circuit that stores a number and increments the value on every clock edge – on reaching the largest value, it starts again from 0 Draw the state diagram: How many states? How many inputs?
264
3-Bit Counter Consider a circuit that stores a number and increments the value on every clock edge – on reaching the largest value, it starts again from 0 Draw the state diagram: How many states? How many inputs? 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111
265
Traffic Light Controller
Problem description: A traffic light with only green and red; either the North-South road has green or the East-West road has green (both can’t be red); there are detectors on the roads to indicate if a car is on the road; the lights are updated every 30 seconds; a light need change only if a car is waiting on the other road State Transition Table: How many states? How many inputs? How many outputs?
266
State Transition Table
Problem description: A traffic light with only green and red; either the North-South road has green or the East-West road has green (both can’t be red); there are detectors on the roads to indicate if a car is on the road; the lights are updated every 30 seconds; a light must change only if a car is waiting on the other road State Transition Table: CurrState InputEW InputNS NextState=Output N N N N N E N E E E E N E E E N
267
State Diagram State Transition Table:
CurrState InputEW InputNS NextState=Output N N N N N E N E E E E N E E E N
268
Lecture 14: FSM and Basic CPU Design
Chapter : 5 Lecture 14: FSM and Basic CPU Design Finite state machines Single-cycle CPU
269
Sequential Circuits We want the clock to act like a start and stop signal – a “latch” is a storage device that stores its inputs at a rising clock edge and this storage will not change until the next rising clock edge Clock Clock Combinational Circuit Combinational Circuit Outputs Inputs Latch Latch
270
Finite State Machine A sequential circuit is described by a variation of a truth table – a finite state diagram (hence, the circuit is also called a finite state machine) Note that state is updated only on a clock edge Next state Next-state Function Current State Clock Output Function Outputs Inputs
271
State Diagrams Each state is shown with a circle, labeled with the state value – the contents of the circle are the outputs An arc represents a transition to a different state, with the inputs indicated on the label D = 0 D = 1 This is a state diagram for ___? D = 1 1 1 D = 0
272
3-Bit Counter Consider a circuit that stores a number and increments the value on every clock edge – on reaching the largest value, it starts again from 0 Draw the state diagram: How many states? How many inputs?
273
3-Bit Counter Consider a circuit that stores a number and increments the value on every clock edge – on reaching the largest value, it starts again from 0 Draw the state diagram: How many states? How many inputs? 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111
274
Traffic Light Controller
Problem description: A traffic light with only green and red; either the North-South road has green or the East-West road has green (both can’t be red); there are detectors on the roads to indicate if a car is on the road; the lights are updated every 30 seconds; a light need change only if a car is waiting on the other road State Transition Table: How many states? How many inputs? How many outputs?
275
State Transition Table
Problem description: A traffic light with only green and red; either the North-South road has green or the East-West road has green (both can’t be red); there are detectors on the roads to indicate if a car is on the road; the lights are updated every 30 seconds; a light must change only if a car is waiting on the other road State Transition Table: CurrState InputEW InputNS NextState=Output N N N N N E N E E E E N E E E N
276
State Diagram State Transition Table:
CurrState InputEW InputNS NextState=Output N N N N N E N E E E E N E E E N
277
Basic MIPS Architecture
Now that we understand clocks and storage of states, we’ll design a simple CPU that executes: basic math (add, sub, and, or, slt) memory access (lw and sw) branch and jump instructions (beq and j)
278
Implementation Overview
We need memory to store instructions to store data for now, let’s make them separate units We need registers, ALU, and a whole lot of control logic CPU operations common to all instructions: use the program counter (PC) to pull instruction out of instruction memory read register values
279
Note: we haven’t bothered
View from 30,000 Feet Note: we haven’t bothered showing multiplexors What is the role of the Add units? Explain the inputs to the data memory unit Explain the inputs to the ALU Explain the inputs to the register unit
280
Clocking Methodology Which of the above units need a clock?
What is being saved (latched) on the rising edge of the clock? Keep in mind that the latched value remains there for an entire cycle
281
Implementing R-type Instructions
Instructions of the form add $t1, $t2, $t3 Explain the role of each signal
282
Implementing Loads/Stores
Instructions of the form lw $t1, 8($t2) and sw $t1, 8($t2) Where does this input come from?
283
Implementing J-type Instructions
Instructions of the form beq $t1, $t2, offset
284
View from 10,000 Feet
285
View from 5,000 Feet
286
Single Vs. Multi-Cycle Machine
In this implementation, every instruction requires one cycle to complete cycle time = time taken for the slowest instruction If the execution was broken into multiple (faster) cycles, the shorter instructions can finish sooner Cycle time = 20 ns Cycle time = 5 ns 1 cycle 4 cycles Load Load 1 cycle 3 cycles Add Add 1 cycle 2 cycles Beq Beq Time for a load, add, and beq = ns ns
287
Lecture 16: Basic CPU Design
Single-cycle CPU Multi-cycle CPU
288
Basic MIPS Architecture
Now that we understand clocks and storage of states, we’ll design a simple CPU that executes: basic math (add, sub, and, or, slt) memory access (lw and sw) branch and jump instructions (beq and j)
289
Implementation Overview
We need memory to store instructions to store data for now, let’s make them separate units We need registers, ALU, and a whole lot of control logic CPU operations common to all instructions: use the program counter (PC) to pull instruction out of instruction memory read register values
290
Note: we haven’t bothered
View from 30,000 Feet Note: we haven’t bothered showing multiplexors What is the role of the Add units? Explain the inputs to the data memory unit Explain the inputs to the ALU Explain the inputs to the register unit
291
Clocking Methodology Which of the above units need a clock?
What is being saved (latched) on the rising edge of the clock? Keep in mind that the latched value remains there for an entire cycle
292
Implementing R-type Instructions
Instructions of the form add $t1, $t2, $t3 Explain the role of each signal
293
Implementing Loads/Stores
Instructions of the form lw $t1, 8($t2) and sw $t1, 8($t2) Where does this input come from?
294
Implementing J-type Instructions
Instructions of the form beq $t1, $t2, offset
295
View from 10,000 Feet
296
View from 5,000 Feet
297
Single Vs. Multi-Cycle Machine
In this implementation, every instruction requires one cycle to complete cycle time = time taken for the slowest instruction If the execution was broken into multiple (faster) cycles, the shorter instructions can finish sooner Cycle time = 20 ns Cycle time = 5 ns 1 cycle 4 cycles Load Load 1 cycle 3 cycles Add Add 1 cycle 2 cycles Beq Beq Time for a load, add, and beq = ns ns
298
Multi-Cycle Processor
Single memory unit shared by instructions and memory Single ALU also used for PC updates Registers (latches) to store the result of every block
299
Cycle 1 The PC is used to select the appropriate instruction out
of the memory unit The instruction is latched into the instruction register at the end of the clock cycle The ALU performs PC+4 and stores it in the PC at the end of the clock cycle (note that ALU is free this cycle) The control circuits must now be “cycle-aware” – the new PC need not look up the instr-memory until we’re done executing the current instruction
300
Cycle 2 The instruction specifies the required register values –
these are read from the register file and stored in latches A and B (this happens even if the operands are not required) The last 16 bits are also used to compute PC+4+offset (in case this instruction turns out to be a branch) – this is latched into ALUOut Note that we haven’t yet figured out the instruction type, so the above operations are “speculative”
301
Cycle 3 The operations depend on the instruction type
Memory access: the address is computed by adding the offset to the value read from the register file, result is latched into ALUOut ALU: ALU operations are performed on the values read from the register file and the result is latched into ALUOut Branch: the ALU performs the operations for “beq” and if the branch happens, the branch target (currently in ALUOut) is latched into the PC at the end of the cycle Note that the branch operation has completed by the end of cycle 3, the other two are still
302
Cycle 4 Memory access: the address in ALUOut is used to pick
out a word from memory – this is latched into the memory data register ALU: the result latched into ALUOut is fed as input to the register file, the instruction stored in the instruction-latch specifies where the result is written into At the end of this cycle, the ALU operation and memory writes are complete
303
Cycle 5 Memory read: the value read from memory (and latched
in the memory data register) is now written into the register file Summary: Branches and jumps: 3 cycles ALU, stores: 4 cycles Memory access: 5 cycles ALU is slower since it requires a register file write Store is slower since it requires a data memory write Load is slower since it requires a data memory read and a register file write
304
Average CPI Now we can compute average CPI for a program: if the
given program is composed of loads (25%), stores (10%), branches (13%), and ALU ops (52%), the average CPI is 0.25 x x x x 4 = 4.12 You can break this CPU design into shorter cycles, for example, a load would then take 10 cycles, stores 8, ALU 8, branch 6 average CPI would double, but so would the clock speed, the net performance would remain roughly the same Later, we’ll see that this strategy does help in most other cases.
305
Control Logic Note that the control signals for every unit are determined by two factors: the instruction type the cycle number for this instruction The control is therefore implemented as a finite state machine – every cycle, the FSM transitions to a new state with a certain set of outputs (the control signals) and this is a function of the inputs (the instr type)
306
Lecture 17: Basic Pipelining
Chapter : 6 Lecture 17: Basic Pipelining 5-stage pipeline Hazards and instruction scheduling
307
Multi-Cycle Processor
Single memory unit shared by instructions and memory Single ALU also used for PC updates Registers (latches) to store the result of every block
308
The Assembly Line Unpipelined Pipelined
Start and finish a job before moving to the next Jobs Time A B C Break the job into smaller stages A B C A B C A B C Pipelined
309
Performance Improvements?
Does it take longer to finish each individual job? Does it take shorter to finish a series of jobs? What assumptions were made while answering these questions? Is a 10-stage pipeline better than a 5-stage pipeline?
310
Quantitative Effects As a result of pipelining:
Time in ns per instruction goes up Each instruction takes more cycles to execute But… average CPI remains roughly the same Clock speed goes up Total execution time goes down, resulting in lower average time per instruction Under ideal conditions, speedup = ratio of elapsed times between successive instruction completions = number of pipeline stages = increase in clock speed
311
A 5-Stage Pipeline
312
A 5-Stage Pipeline Use the PC to access the I-cache and increment PC by 4
313
A 5-Stage Pipeline Read registers, compare registers, compute branch target; for now, assume branches take 2 cyc (there is enough work that branches can easily take more)
314
A 5-Stage Pipeline ALU computation, effective address computation for load/store
315
A 5-Stage Pipeline Memory access to/from data cache, stores finish in 4 cycles
316
A 5-Stage Pipeline Write result of ALU computation or load into register file
317
Conflicts/Problems I-cache and D-cache are accessed in the same cycle – it helps to implement them separately Registers are read and written in the same cycle – easy to deal with if register read/write time equals cycle time/2 (else, use bypassing) Branch target changes only at the end of the second stage -- what do you do in the meantime? Data between stages get latched into registers (overhead that increases latency per instruction)
318
Hazards Structural hazards: different instructions in different stages
(or the same stage) conflicting for the same resource Data hazards: an instruction cannot continue because it needs a value that has not yet been generated by an earlier instruction Control hazard: fetch cannot continue because it does not know the outcome of an earlier branch – special case of a data hazard – separate category because they are treated in different ways
319
Structural Hazards Example: a unified instruction and data cache
stage 4 (MEM) and stage 1 (IF) can never coincide The later instruction and all its successors are delayed until a cycle is found when the resource is free these are pipeline bubbles Structural hazards are easy to eliminate – increase the number of resources (for example, implement a separate instruction and data cache)
320
Data Hazards
321
Bypassing Some data hazard stalls can be eliminated: bypassing
322
Data Hazard Stalls
323
Data Hazard Stalls
324
Example add $1, $2, $3 lw $4, 8($1)
325
Example lw $1, 8($2) lw $4, 8($1)
326
Example lw $1, 8($2) sw $1, 8($3)
327
Control Hazards Simple techniques to handle control hazard stalls:
for every branch, introduce a stall cycle (note: every 6th instruction is a branch!) assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instruction fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost
328
Branch Delay Slots
329
Slowdowns from Stalls Perfect pipelining with no hazards an instruction completes every cycle (total cycles ~ num instructions) speedup = increase in clock speed = num pipeline stages With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes Total cycles = number of instructions + stall cycles
330
Lecture 18: Pipelining Hazards and instruction scheduling
Branch prediction Out-of-order execution
331
Structural Hazards Example: a unified instruction and data cache
stage 4 (MEM) and stage 1 (IF) can never coincide The later instruction and all its successors are delayed until a cycle is found when the resource is free these are pipeline bubbles Structural hazards are easy to eliminate – increase the number of resources (for example, implement a separate instruction and data cache)
332
Data Hazards
333
Bypassing Some data hazard stalls can be eliminated: bypassing
334
Example add $1, $2, $3 lw $4, 8($1)
335
Example lw $1, 8($2) lw $4, 8($1)
336
Example lw $1, 8($2) sw $1, 8($3)
337
Control Hazards Simple techniques to handle control hazard stalls:
for every branch, introduce a stall cycle (note: every 6th instruction is a branch!) assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instruction fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost
338
Branch Delay Slots
339
Pipeline without Branch Predictor
PC IF (br) Reg Read Compare Br-target PC + 4
340
Pipeline with Branch Predictor
PC IF (br) Reg Read Compare Br-target Branch Predictor
341
Bimodal Predictor Branch PC Table of 16K entries 14 bits of 2-bit
saturating counters 14 bits Branch PC
342
2-Bit Prediction For each branch, maintain a 2-bit saturating counter:
if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1) … sound familiar? If (counter >= 2), predict taken, else predict not taken The counter attempts to capture the common case for each branch
343
Slowdowns from Stalls Perfect pipelining with no hazards an instruction completes every cycle (total cycles ~ num instructions) speedup = increase in clock speed = num pipeline stages With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes Total cycles = number of instructions + stall cycles
344
Multicycle Instructions
Multiple parallel pipelines – each pipeline can have a different number of stages Instructions can now complete out of order – must make sure that writes to a register happen in the correct order
345
An Out-of-Order Processor Implementation
Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 Register File R1-R32 R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Decode & Rename T1 R1+R2 T2 T1+R3 BEQZ T2 T4 T1+T2 T5 T4+T2 ALU ALU ALU Instr Fetch Queue Results written to ROB and tags broadcast to IQ Issue Queue (IQ)
346
Chapter : 7 Lecture 19: Cache Basics Out-of-order execution
Cache hierarchies
347
Multicycle Instructions
Multiple parallel pipelines – each pipeline can have a different number of stages Instructions can now complete out of order – must make sure that writes to a register happen in the correct order
348
An Out-of-Order Processor Implementation
Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 Register File R1-R32 R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Decode & Rename T1 R1+R2 T2 T1+R3 BEQZ T2 T4 T1+T2 T5 T4+T2 ALU ALU ALU Instr Fetch Queue Results written to ROB and tags broadcast to IQ Issue Queue (IQ)
349
Cache Hierarchies Data and instructions are stored on DRAM chips – DRAM is a technology that has high bit density, but relatively poor latency – an access to data in memory can take as many as 300 cycles today! Hence, some data is stored on the processor in a structure called the cache – caches employ SRAM technology, which is faster, but has lower bit density Internet browsers also cache web pages – same concept
350
Memory Hierarchy As you go further, capacity and latency increase
Disk 80 GB 10M cycles Memory 1GB 300 cycles L2 cache 2MB 15 cycles L1 data or instruction Cache 32KB 2 cycles Registers 1KB 1 cycle
351
Locality Why do caches work?
Temporal locality: if you used some data recently, you will likely use it again Spatial locality: if you used some data recently, you will likely access its neighbors No hierarchy: average access time for data = 300 cycles 32KB 1-cycle L1 cache that has a hit rate of 95%: average access time = 0.95 x x (301) = 16 cycles
352
Accessing the Cache Byte address 101000 Offset 8-byte words
8 words: 3 index bits Direct-mapped cache: each address maps to a unique address Sets Data array
353
The Tag Array Byte address 101000 Tag 8-byte words Compare
Direct-mapped cache: each address maps to a unique address Tag array Data array
354
Example Access Pattern
Byte address Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 10, 13, 16, 68, 73, 78, 83, 88, 4, 7, 10… 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array
355
Increasing Line Size Byte address
A large cache line size smaller tag array, fewer misses because of spatial locality 32-byte cache line size or block size Tag Offset Tag array Data array
356
Associativity Byte address
Set associativity fewer conflicts; wasted power because multiple data and tags are read Tag Way-1 Way-2 Tag array Data array Compare
357
How many offset/index/tag bits if the cache has
Associativity How many offset/index/tag bits if the cache has 64 sets, each set has 64 bytes, 4 ways Byte address Tag Way-1 Way-2 Tag array Data array Compare
358
Example 32 KB 4-way set-associative data cache array with 32
byte line sizes How many sets? How many index bits, offset bits, tag bits? How large is the tag array?
359
Cache Misses On a write miss, you may either choose to bring the block
into the cache (write-allocate) or not (write-no-allocate) On a read miss, you always bring the block in (spatial and temporal locality) – but which block do you replace? no choice for a direct-mapped cache randomly pick one of the ways to replace replace the way that was least-recently used (LRU) FIFO replacement (round-robin)
360
Writes When you write into a block, do you also update the copy in L2?
write-through: every write to L1 write to L2 write-back: mark the block as dirty, when the block gets replaced from L1, write it to L2 Writeback coalesces multiple writes to an L1 block into one L2 write Writethrough simplifies coherency protocols in a multiprocessor system as the L2 always has a current copy of data
361
Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed – the misses for an infinite cache Capacity misses: happens because the program touched many other words before re-touching the same word – the misses for a fully-associative cache Conflict misses: happens because two words map to the same location in the cache – the misses generated while moving from a fully-associative to a direct-mapped cache
362
Lecture 20: Cache Hierarchies, Virtual Memory
363
Example Access Pattern
Byte address Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 10, 13, 16, 68, 73, 78, 83, 88, 4, 7, 10… 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array
364
Increasing Line Size Byte address
A large cache line size smaller tag array, fewer misses because of spatial locality 32-byte cache line size or block size Tag Offset Tag array Data array
365
Associativity Byte address
Set associativity fewer conflicts; wasted power because multiple data and tags are read Tag Way-1 Way-2 Tag array Data array Compare
366
How many offset/index/tag bits if the cache has
Associativity How many offset/index/tag bits if the cache has 64 sets, each set has 64 bytes, 4 ways Byte address Tag Way-1 Way-2 Tag array Data array Compare
367
Example 32 KB 4-way set-associative data cache array with 32
byte line sizes How many sets? How many index bits, offset bits, tag bits? How large is the tag array?
368
Cache Misses On a write miss, you may either choose to bring the block
into the cache (write-allocate) or not (write-no-allocate) On a read miss, you always bring the block in (spatial and temporal locality) – but which block do you replace? no choice for a direct-mapped cache randomly pick one of the ways to replace replace the way that was least-recently used (LRU) FIFO replacement (round-robin)
369
Writes When you write into a block, do you also update the copy in L2?
write-through: every write to L1 write to L2 write-back: mark the block as dirty, when the block gets replaced from L1, write it to L2 Writeback coalesces multiple writes to an L1 block into one L2 write Writethrough simplifies coherency protocols in a multiprocessor system as the L2 always has a current copy of data
370
Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed – the misses for an infinite cache Capacity misses: happens because the program touched many other words before re-touching the same word – the misses for a fully-associative cache Conflict misses: happens because two words map to the same location in the cache – the misses generated while moving from a fully-associative to a direct-mapped cache
371
Virtual Memory Processes deal with virtual memory – they have the
illusion that a very large address space is available to them There is only a limited amount of physical memory that is shared by all processes – a process places part of its virtual memory in this physical memory and the rest is stored on disk (called swap space) Thanks to locality, disk access is likely to be uncommon The hardware ensures that one process cannot access the memory of a different process
372
Translated to physical
Address Translation The virtual and physical memory are broken up into pages 8KB page size Virtual address 13 virtual page number page offset Translated to physical page number Physical address
373
Memory Hierarchy Properties
A virtual memory page can be placed anywhere in physical memory (fully-associative) Replacement is usually LRU (since the miss penalty is huge, we can invest some effort to minimize misses) A page table (indexed by virtual page number) is used for translating virtual to physical page number The page table is itself in memory
374
TLB Since the number of pages is very high, the page table
capacity is too large to fit on chip A translation lookaside buffer (TLB) caches the virtual to physical page number translation for recent accesses A TLB miss requires us to access the page table, which may not even be found in the cache – two expensive memory look-ups to access one word of data! A large page size can increase the coverage of the TLB and reduce the capacity of the page table, but also increases memory wastage
375
TLB and Cache Is the cache indexed with virtual or physical address?
To index with a physical address, we will have to first look up the TLB, then the cache longer access time Multiple virtual addresses can map to the same physical address – must ensure that these different virtual addresses will map to the same location in cache – else, there will be two different copies of the same physical memory word Does the tag array store virtual or physical addresses? Since multiple virtual addresses can map to the same physical address, a virtual tag comparison can flag a miss even if the correct physical memory word is present
376
Cache and TLB Pipeline Virtually Indexed; Physically Tagged Cache
Virtual address Offset Virtual page number Virtual index TLB Tag array Data array Physical page number Physical tag Physical tag comparion Virtually Indexed; Physically Tagged Cache
377
Bad Events Consider the longest latency possible for a load instruction: TLB miss: must look up page table to find translation for v.page P Calculate the virtual memory address for the page table entry that has the translation for page P – let’s say, this is v.page Q TLB miss for v.page Q: will require navigation of a hierarchical page table (let’s ignore this case for now and assume we have succeeded in finding the physical memory location (R) for page Q) Access memory location R (find this either in L1, L2, or memory) We now have the translation for v.page P – put this into the TLB We now have a TLB hit and know the physical page number – this allows us to do tag comparison and check the L1 cache for a hit If there’s a miss in L1, check L2 – if that misses, check in memory At any point, if the page table entry claims that the page is on disk, flag a page fault – the OS then copies the page from disk to memory and the hardware resumes what it was doing before the page fault … phew!
378
Lecture 21: Virtual Memory, I/O Basics
I/O overview
379
Virtual Memory Processes deal with virtual memory – they have the
illusion that a very large address space is available to them There is only a limited amount of physical memory that is shared by all processes – a process places part of its virtual memory in this physical memory and the rest is stored on disk (called swap space) Thanks to locality, disk access is likely to be uncommon The hardware ensures that one process cannot access the memory of a different process
380
Translated to physical
Address Translation The virtual and physical memory are broken up into pages 8KB page size Virtual address 13 virtual page number page offset Translated to physical page number Physical address
381
Memory Hierarchy Properties
A virtual memory page can be placed anywhere in physical memory (fully-associative) Replacement is usually LRU (since the miss penalty is huge, we can invest some effort to minimize misses) A page table (indexed by virtual page number) is used for translating virtual to physical page number The page table is itself in memory
382
TLB Since the number of pages is very high, the page table
capacity is too large to fit on chip A translation lookaside buffer (TLB) caches the virtual to physical page number translation for recent accesses A TLB miss requires us to access the page table, which may not even be found in the cache – two expensive memory look-ups to access one word of data! A large page size can increase the coverage of the TLB and reduce the capacity of the page table, but also increases memory wastage
383
TLB and Cache Is the cache indexed with virtual or physical address?
To index with a physical address, we will have to first look up the TLB, then the cache longer access time Multiple virtual addresses can map to the same physical address – must ensure that these different virtual addresses will map to the same location in cache – else, there will be two different copies of the same physical memory word Does the tag array store virtual or physical addresses? Since multiple virtual addresses can map to the same physical address, a virtual tag comparison can flag a miss even if the correct physical memory word is present
384
Cache and TLB Pipeline Virtually Indexed; Physically Tagged Cache
Virtual address Offset Virtual page number Virtual index TLB Tag array Data array Physical page number Physical tag Physical tag comparion Virtually Indexed; Physically Tagged Cache
385
Bad Events Consider the longest latency possible for a load instruction: TLB miss: must look up page table to find translation for v.page P Calculate the virtual memory address for the page table entry that has the translation for page P – let’s say, this is v.page Q TLB miss for v.page Q: will require navigation of a hierarchical page table (let’s ignore this case for now and assume we have succeeded in finding the physical memory location (R) for page Q) Access memory location R (find this either in L1, L2, or memory) We now have the translation for v.page P – put this into the TLB We now have a TLB hit and know the physical page number – this allows us to do tag comparison and check the L1 cache for a hit If there’s a miss in L1, check L2 – if that misses, check in memory At any point, if the page table entry claims that the page is on disk, flag a page fault – the OS then copies the page from disk to memory and the hardware resumes what it was doing before the page fault … phew!
386
Input/Output CPU Cache Bus Memory Disk Network USB DVD …
387
… I/O Hierarchy CPU Cache Disk Memory Bus Memory I/O Controller
I/O Bus Network USB DVD …
388
Intel Example P4 Processor System bus 800 MHz, 604 GB/sec Memory
Graphics output Memory Controller Hub (North Bridge) Main Memory 2.1 GB/sec DDR 400 3.2 GB/sec 1 Gb Ethernet 266 MB/sec 266 MB/sec Serial ATA 150 MB/s I/O Controller Hub (South Bridge) CD/DVD Disk 100 MB/s Tape 100 MB/s USB 2.0 60 MB/s
389
Bus Design The bus is a shared resource – any device can send
data on the bus (after first arbitrating for it) and all other devices can read this data off the bus The address/control signals on the bus specify the intended receiver of the message The length of the bus determines its speed (hence, a hierarchy makes sense) Buses can be synchronous (a clock determines when each operation must happen) or asynchronous (a handshaking protocol is used to co-ordinate operations)
390
Memory-Mapped I/O Each I/O device has its own special address range
The CPU issues commands such as these: sw [some-data] [some-address] Usually, memory services these requests… if the address is in the I/O range, memory ignores it The data is written into some register in the appropriate I/O device – this serves as the command to the device
391
Polling Vs. Interrupt-Driven
When the I/O device is ready to respond, it can send an interrupt to the CPU; the CPU stops what it was doing; the OS examines the interrupt and then reads the data produced by the I/O device (and usually stores into memory) In the polling approach, the CPU (OS) periodically checks the status of the I/O device and if the device is ready with data, the OS reads it
392
Direct Memory Access (DMA)
Consider a disk read example: a block in disk is being read into memory For each word, the CPU does a lw [destination register] [I/O device address] and a sw [data in above register] [memory-address] This would take up too much of the CPU’s time – hence, the task is off-loaded to the DMA controller – the CPU informs the DMA of the range of addresses to be copied and the DMA lets the CPU know when it is done
393
Lecture 22: I/O, Disk Systems
Chapter : 8 I/O overview Disk basics RAID Lecture 22: I/O, Disk Systems
394
Input/Output CPU Cache Bus Memory Disk Network USB DVD …
395
… I/O Hierarchy CPU Cache Disk Memory Bus Memory I/O Controller
I/O Bus Network USB DVD …
396
Intel Example P4 Processor System bus 800 MHz, 604 GB/sec Memory
Graphics output Memory Controller Hub (North Bridge) Main Memory 2.1 GB/sec DDR 400 3.2 GB/sec 1 Gb Ethernet 266 MB/sec 266 MB/sec Serial ATA 150 MB/s I/O Controller Hub (South Bridge) CD/DVD Disk 100 MB/s Tape 100 MB/s USB 2.0 60 MB/s
397
Bus Design The bus is a shared resource – any device can send
data on the bus (after first arbitrating for it) and all other devices can read this data off the bus The address/control signals on the bus specify the intended receiver of the message The length of the bus determines its speed (hence, a hierarchy makes sense) Buses can be synchronous (a clock determines when each operation must happen) or asynchronous (a handshaking protocol is used to co-ordinate operations)
398
Memory-Mapped I/O Each I/O device has its own special address range
The CPU issues commands such as these: sw [some-data] [some-address] Usually, memory services these requests… if the address is in the I/O range, memory ignores it The data is written into some register in the appropriate I/O device – this serves as the command to the device
399
Polling Vs. Interrupt-Driven
When the I/O device is ready to respond, it can send an interrupt to the CPU; the CPU stops what it was doing; the OS examines the interrupt and then reads the data produced by the I/O device (and usually stores into memory) In the polling approach, the CPU (OS) periodically checks the status of the I/O device and if the device is ready with data, the OS reads it
400
Direct Memory Access (DMA)
Consider a disk read example: a block in disk is being read into memory For each word, the CPU does a lw [destination register] [I/O device address] and a sw [data in above register] [memory-address] This would take up too much of the CPU’s time – hence, the task is off-loaded to the DMA controller – the CPU informs the DMA of the range of addresses to be copied and the DMA lets the CPU know when it is done
401
Role of I/O Activities external to the CPU are typically orders of
magnitude slower Example: while CPU performance has improved by 50% per year, disk latencies have improved by 10% every year Typical strategy on I/O: switch contexts and work on something else Other metrics, such as bandwidth, reliability, availability, and capacity, often receive more attention than performance
402
Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material on both sides), with diameters between inches Each platter is comprised of concentric tracks (5-30K) and each track is divided into sectors (100 – 500 per track, each about 512 bytes) A movable arm holds the read/write heads for each disk surface and moves them all in tandem – a cylinder of data is accessible at a time
403
Disk Latency To read/write data, the arm has to be placed on the
correct track – this seek time usually takes 5 to 12 ms on average – can take less if there is spatial locality Rotational latency is the time taken to rotate the correct sector under the head – average is typically more than 2 ms (15,000 RPM) Transfer time is the time taken to transfer a block of bits out of the disk and is typically 3 – 65 MB/second A disk controller maintains a disk cache (spatial locality can be exploited) and sets up the transfer on the bus (controller overhead)
404
Defining Reliability and Availability
A system toggles between Service accomplishment: service matches specifications Service interruption: service deviates from specs The toggle is caused by failures and restorations Reliability measures continuous service accomplishment and is usually expressed as mean time to failure (MTTF) Availability measures fraction of time that service matches specifications, expressed as MTTF / (MTTF + MTTR)
405
RAID Reliability and availability are important metrics for disks
RAID: redundant array of inexpensive (independent) disks Redundancy can deal with one or more failures Each sector of a disk records check information that allows it to determine if the disk has an error or not (in other words, redundancy already exists within a disk) When the disk read flags an error, we turn elsewhere for correct data
406
RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) – it
uses an array of disks and stripes (interleaves) data across the arrays to improve parallelism and throughput RAID 1 mirrors or shadows every disk – every write happens to two disks Reads to the mirror may happen only when the primary disk fails – or, you may try to read both together and the quicker response is accepted Expensive solution: high reliability at twice the cost
407
RAID 3 Data is bit-interleaved across several disks and a separate
disk maintains parity information for a set of bits For example: with 8 disks, bit 0 is in disk-0, bit 1 is in disk-1, …, bit 7 is in disk-7; disk-8 maintains parity for all 8 bits For any read, 8 disks must be accessed (as we usually read more than a byte at a time) and for any write, 9 disks must be accessed as parity has to be re-calculated High throughput for a single request, low cost for redundancy (overhead: 12.5%), low task-level parallelism
408
RAID 4 and RAID 5 Data is block interleaved – this allows us to get all our data from a single disk on a read – in case of a disk error, read all 9 disks Block interleaving reduces thruput for a single request (as only a single disk drive is servicing the request), but improves task-level parallelism as other disk drives are free to service other requests On a write, we access the disk that stores the data and the parity disk – parity information can be updated simply by checking if the new data differs from the old data
409
RAID 5 If we have a single disk for parity, multiple writes can not
happen in parallel (as all writes must update parity info) RAID 5 distributes the parity block to allow simultaneous writes
410
RAID Summary RAID 1-5 can tolerate a single fault – mirroring (RAID 1)
has a 100% overhead, while parity (RAID 3, 4, 5) has modest overhead Can tolerate multiple faults by having multiple check functions – each additional check can cost an additional disk (RAID 6) RAID 6 and RAID 2 (memory-style ECC) are not commercially employed
411
I/O Performance Throughput (bandwidth) and response times (latency)
are the key performance metrics for I/O The description of the hardware characterizes maximum throughput and average response time (usually with no queueing delays) The description of the workload characterizes the “real” throughput – corresponding to this throughput is an average response time
412
Throughput Vs. Response Time
As load increases, throughput increases (as utilization is high) – simultaneously, response times also go up as the probability of having to wait for the service goes up: trade-off between throughput and response time In systems involving human interaction, there are three relevant delays: data entry time, system response time, and think time – studies have shown that improvements in response time result in improvements in think time better response time and much better throughput Most benchmark suites try to determine throughput while placing a restriction on response times
413
Lecture 23: Multiprocessors
Chapter : 9 Lecture 23: Multiprocessors RAID Multiprocessor taxonomy Snooping-based cache coherence protocol
414
RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) – it
uses an array of disks and stripes (interleaves) data across the arrays to improve parallelism and throughput RAID 1 mirrors or shadows every disk – every write happens to two disks Reads to the mirror may happen only when the primary disk fails – or, you may try to read both together and the quicker response is accepted Expensive solution: high reliability at twice the cost
415
RAID 3 Data is bit-interleaved across several disks and a separate
disk maintains parity information for a set of bits For example: with 8 disks, bit 0 is in disk-0, bit 1 is in disk-1, …, bit 7 is in disk-7; disk-8 maintains parity for all 8 bits For any read, 8 disks must be accessed (as we usually read more than a byte at a time) and for any write, 9 disks must be accessed as parity has to be re-calculated High throughput for a single request, low cost for redundancy (overhead: 12.5%), low task-level parallelism
416
RAID 4 and RAID 5 Data is block interleaved – this allows us to get all our data from a single disk on a read – in case of a disk error, read all 9 disks Block interleaving reduces thruput for a single request (as only a single disk drive is servicing the request), but improves task-level parallelism as other disk drives are free to service other requests On a write, we access the disk that stores the data and the parity disk – parity information can be updated simply by checking if the new data differs from the old data
417
RAID 5 If we have a single disk for parity, multiple writes can not
happen in parallel (as all writes must update parity info) RAID 5 distributes the parity block to allow simultaneous writes
418
RAID Summary RAID 1-5 can tolerate a single fault – mirroring (RAID 1)
has a 100% overhead, while parity (RAID 3, 4, 5) has modest overhead Can tolerate multiple faults by having multiple check functions – each additional check can cost an additional disk (RAID 6) RAID 6 and RAID 2 (memory-style ECC) are not commercially employed
419
Multiprocessor Taxonomy
SISD: single instruction and single data stream: uniprocessor MISD: no commercial multiprocessor: imagine data going through a pipeline of execution engines SIMD: vector architectures: lower flexibility MIMD: most multiprocessors today: easy to construct with off-the-shelf computers, most flexibility
420
Memory Organization - I
Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) Multiple processors connected to a single centralized memory – since all processors see the same memory organization uniform memory access (UMA) Shared-memory because all processors can access the entire memory address space Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors
421
SMPs or Centralized Shared-Memory
Processor Processor Processor Processor Caches Caches Caches Caches Main Memory I/O System
422
Memory Organization - II
For higher scalability, memory is distributed among processors distributed memory multiprocessors If one processor can directly address the memory local to another processor, the address space is shared distributed shared-memory (DSM) multiprocessor If memories are strictly local, we need messages to communicate data cluster of computers or multicomputers Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory
423
Interconnection network
Distributed Memory Multiprocessors Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Interconnection network
424
SMPs Centralized main memory and many caches many
copies of the same data A system is cache coherent if a read returns the most recently written value for that word Time Event Value of X in Cache-A Cache-B Memory CPU-A reads X CPU-B reads X CPU-A stores 0 in X
425
Cache Coherence A memory system is coherent if:
P writes to X; no other processor writes to X; P reads X and receives the value previously written by P P1 writes to X; no other processor writes to X; sufficient time elapses; P2 reads X and receives value written by P1 Two writes to the same location by two processors are seen in the same order by all processors – write serialization The memory consistency model defines “time elapsed” before the effect of a processor is seen by others
426
Cache Coherence Protocols
Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block
427
Design Issues Three states for a block: invalid, shared, modified
A write is placed on the bus and sharers invalidate themselves Processor Processor Processor Processor Caches Caches Caches Caches Main Memory I/O System
428
Lecture 24: Multiprocessors
Directory-based cache coherence protocol Synchronization Consistency Writing parallel programs
429
Snooping-Based Protocols
Three states for a block: invalid, shared, modified A write is placed on the bus and sharers invalidate themselves The protocols are referred to as MSI, MESI, etc. Processor Processor Processor Processor Caches Caches Caches Caches Main Memory I/O System
430
Example P1 reads X: not found in cache-1, request sent on bus, memory responds, X is placed in cache-1 in shared state P2 reads X: not found in cache-2, request sent on bus, everyone snoops this request, cache-1does nothing because this is just a read request, memory responds, X is placed in cache-2 in shared state P1 P2 P1 writes X: cache-1 has data in shared state (shared only provides read perms), request sent on bus, cache-2 snoops and then invalidates its copy of X, cache-1 moves its state to modified P2 reads X: cache-2 has data in invalid state, request sent on bus, cache-1 snoops and realizes it has the only valid copy, so it downgrades itself to shared state and responds with data, X is placed in cache-2 in shared state Cache-1 Cache-2 Main Memory
431
Cache Coherence Protocols
Directory-based: A single location (directory) keeps track of the sharing status of a block of memory Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block
432
Coherence in Distributed Memory Multiprocs
Distributed memory systems are typically larger bus-based snooping may not work well Option 1: software-based mechanisms – message-passing systems or software-controlled cache coherence Option 2: hardware-based mechanisms – directory-based cache coherence
433
Interconnection network
Distributed Memory Multiprocessors Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Directory Directory Directory Directory Interconnection network
434
Directory-Based Cache Coherence
The physical memory is distributed among all processors The directory is also distributed along with the corresponding memory The physical address is enough to determine the location of memory The (many) processing nodes are connected with a scalable interconnect (not a bus) – hence, messages are no longer broadcast, but routed from sender to receiver – since the processing nodes can no longer snoop, the directory keeps track of sharing state
435
Cache Block States What are the different states a block of memory can have within the directory? Note that we need information for each cache so that invalidate messages can be sent The directory now serves as the arbitrator: if multiple write attempts happen simultaneously, the directory determines the ordering
436
Interconnection network
Directory-Based Example A: Rd X B: Rd X C: Rd X A: Wr X C: Wr X A: Rd Y B: Wr X B: Rd Y B: Wr Y Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Directory Directory X Directory Y Interconnection network
437
Directory Actions If block is in uncached state:
Read miss: send data, make block shared Write miss: send data, make block exclusive If block is in shared state: Read miss: send data, add node to sharers list Write miss: send data, invalidate sharers, make excl If block is in exclusive state: Read miss: ask owner for data, write to memory, send data, make shared, add node to sharers list Data write back: write to memory, make uncached Write miss: ask owner for data, write to memory, send data, update identity of new owner, remain exclusive
438
Constructing Locks Applications have phases (consisting of many instructions) that must be executed atomically, without other parallel processes modifying the data A lock surrounding the data/code ensures that only one program can be in a critical section at a time The hardware must provide some basic primitives that allows us to construct locks with different properties Bank balance $1000 Parallel (unlocked) banking transactions Rd $1000 Add $100 Wr $1100 Rd $1000 Add $200 Wr $1200
439
Synchronization The simplest hardware primitive that greatly facilitates synchronization implementations (locks, barriers, etc.) is an atomic read-modify-write Atomic exchange: swap contents of register and memory Special case of atomic exchange: test & set: transfer memory location into register and write 1 into memory (if memory has 0, lock is free) lock: t&s register, location bnz register, lock CS st location, #0 When multiple parallel threads execute this code, only one will be able to enter CS
440
Coherence Vs. Consistency
Recall that coherence guarantees (i) write propagation (a write will eventually be seen by other processors), and (ii) write serialization (all processors see writes to the same location in the same order) The consistency model defines the ordering of writes and reads to different memory locations – the hardware guarantees a certain consistency model and the programmer attempts to write correct programs with those assumptions
441
Consistency Example Consider a multiprocessor with bus-based snooping cache coherence and a write buffer between CPU and cache Initially A = B = 0 P P2 A B 1 … … if (B == 0) if (A == 0) Crit.Section Crit.Section The programmer expected the above code to implement a lock – because of write buffering, both processors can enter the critical section The consistency model lets the programmer know what assumptions they can make about the hardware’s reordering capabilities
442
Sequential Consistency
A multiprocessor is sequentially consistent if the result of the execution is achieveable by maintaining program order within a processor and interleaving accesses by different processors in an arbitrary fashion The multiprocessor in the previous example is not sequentially consistent Can implement sequential consistency by requiring the following: program order, write serialization, everyone has seen an update before a value is read – very intuitive for the programmer, but extremely slow
443
Shared-Memory Vs. Message-Passing
Well-understood programming model Communication is implicit and hardware handles protection Hardware-controlled caching Message-passing: No cache coherence simpler hardware Explicit communication easier for the programmer to restructure code Software-controlled caching Sender can initiate data transfer
444
Ocean Kernel Procedure Solve(A) begin diff = done = 0;
while (!done) do diff = 0; for i 1 to n do for j 1 to n do temp = A[i,j]; A[i,j] 0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure
445
Shared Address Space Model
procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i mymin to mymax for j 1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main
446
Message Passing Model main() for i 1 to nn do read(n); read(nprocs);
CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); RECEIVE(&myA[0,0], n, pid-1, ROW); RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i 1 to nn do for j 1 to n do … endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i 1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; if (mydiff < TOL) done = 1; for i 1 to nprocs-1 do SEND(done, 1, I, DONE); endif endwhile
447
Lecture 25: Multi-core Processors
Writing parallel programs SMT Multi-core examples
448
Shared-Memory Vs. Message-Passing
Well-understood programming model Communication is implicit and hardware handles protection Hardware-controlled caching Message-passing: No cache coherence simpler hardware Explicit communication easier for the programmer to restructure code Software-controlled caching Sender can initiate data transfer
449
. . Ocean Kernel Row 1 Row k Row 2k Row 3k … Procedure Solve(A) begin
diff = done = 0; while (!done) do diff = 0; for i 1 to n do for j 1 to n do temp = A[i,j]; A[i,j] 0.2 * (A[i,j] + neighbors); diff += abs(A[i,j] – temp); end for if (diff < TOL) then done = 1; end while end procedure Row 1 . Row k Row 2k Row 3k …
450
Shared Address Space Model
procedure Solve(A) int i, j, pid, done=0; float temp, mydiff=0; int mymin = 1 + (pid * n/procs); int mymax = mymin + n/nprocs -1; while (!done) do mydiff = diff = 0; BARRIER(bar1,nprocs); for i mymin to mymax for j 1 to n do … endfor LOCK(diff_lock); diff += mydiff; UNLOCK(diff_lock); BARRIER (bar1, nprocs); if (diff < TOL) then done = 1; endwhile int n, nprocs; float **A, diff; LOCKDEC(diff_lock); BARDEC(bar1); main() begin read(n); read(nprocs); A G_MALLOC(); initialize (A); CREATE (nprocs,Solve,A); WAIT_FOR_END (nprocs); end main
451
Message Passing Model main() for i 1 to nn do read(n); read(nprocs);
CREATE (nprocs-1, Solve); Solve(); WAIT_FOR_END (nprocs-1); procedure Solve() int i, j, pid, nn = n/nprocs, done=0; float temp, tempdiff, mydiff = 0; myA malloc(…) initialize(myA); while (!done) do mydiff = 0; if (pid != 0) SEND(&myA[1,0], n, pid-1, ROW); if (pid != nprocs-1) SEND(&myA[nn,0], n, pid+1, ROW); RECEIVE(&myA[0,0], n, pid-1, ROW); RECEIVE(&myA[nn+1,0], n, pid+1, ROW); for i 1 to nn do for j 1 to n do … endfor if (pid != 0) SEND(mydiff, 1, 0, DIFF); RECEIVE(done, 1, 0, DONE); else for i 1 to nprocs-1 do RECEIVE(tempdiff, 1, *, DIFF); mydiff += tempdiff; if (mydiff < TOL) done = 1; for i 1 to nprocs-1 do SEND(done, 1, I, DONE); endif endwhile
452
Multithreading Within a Processor
Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor? Why is this desireable? inexpensive – one CPU, no external interconnects no remote or coherence misses (more capacity misses) Why does this make sense? most processors can’t find enough work – peak IPC is 6, average IPC is 1.5! threads can share resources we can increase threads without a corresponding linear increase in area
453
How are Resources Shared?
Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Thread 1 Thread 2 Thread 3 Cycles Thread 4 Idle Superscalar Fine-Grained Multithreading Simultaneous Multithreading Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot
454
Performance Implications of SMT
Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4
455
Pentium4: Hyper-Threading
Two threads – the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor
456
Multi-Programmed Speedup
457
Why Multi-Cores? New constraints: power, temperature, complexity
Because of the above, we can’t introduce complex techniques to improve single-thread performance Most of the low-hanging fruit for single-thread performance has been picked Hence, additional transistors have the biggest impact on throughput if they are used to execute multiple threads … this assumes that most users will run multi-threaded applications
458
Efficient Use of Transistors
Transistors can be used for: Cache hierarchies Number of cores Multi-threading within a core (SMT) Should we simplify cores so we have more available transistors? Core Cache bank
459
Design Space Exploration
Bullet p – scalar pipelines t – threads s – superscalar pipelines From Davis et al., PACT 2005
460
Case Study I: Sun’s Niagara
Commercial servers require high thread-level throughput and suffer from cache misses Sun’s Niagara focuses on: simple cores (low power, design complexity, can accommodate more cores) fine-grain multi-threading (to tolerate long memory latencies)
461
Niagara Overview
462
SPARC Pipe No branch predictor Low clock speed (1.2 GHz)
One FP unit shared by all cores
463
Case Study II: Intel Core Architecture
Single-thread execution is still considered important out-of-order execution and speculation very much alive initial processors will have few heavy-weight cores To reduce power consumption, the Core architecture (14 pipeline stages) is closer to the Pentium M (12 stages) than the P4 (30 stages) Many transistors invested in a large branch predictor to reduce wasted work (power) Similarly, SMT is also not guaranteed for all incarnations of the Core architecture (SMT makes a hotspot hotter)
464
Cache Organizations for Multi-cores
L1 caches are always private to a core L2 caches can be private or shared – which is better? P1 P2 P3 P4 P1 P2 P3 P4 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2
465
Cache Organizations for Multi-cores
L1 caches are always private to a core L2 caches can be private or shared Advantages of a shared L2 cache: efficient dynamic allocation of space to each core data shared by multiple cores is not replicated every block has a fixed “home” – hence, easy to find the latest copy Advantages of a private L2 cache: quick access to private L2 – good for small working sets private bus to private L2 less contention
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.