Chapter :1 Introduction

Chapter :1 Introduction
logistics why computer organization is important modern trends

Why Computer Organization
Yes, I know, required class…

Why Computer Organization
Embarrassing if you are a BS in CS/CE and can’t make sense of the following terms: DRAM, pipelining, cache hierarchies, I/O, virtual memory Embarrassing if you are a BS in CS/CE and can’t decide which processor to buy: 3 GHz P4 or 2.5 GHz Athlon (helps us reason about performance/power) Obvious first step for chip designers, compiler/OS writers Will knowledge of the hardware help me write better programs?

Must a Programmer Care About Hardware?
Memory management: if we understand how/where data is placed, we can help ensure that relevant data is nearby Thread management: if we understand how threads interact, we can write smarter multi-threaded programs  Why do we care about multi-threaded programs?

Microprocessor Performance
50% improvement every year!! What contributes to this improvement?

Modern Trends Historical contributions to performance:
Better processes (faster devices) ~20% Better circuits/pipelines ~15% Better organization/architecture ~15% In the future, bullet-2 will help little and bullet-3 will not help much for a single core! Pentium P-Pro P-II P-III P Itanium Montecito Year Transistors M M 7.5M 9.5M 42M M M Clock Speed M M 300M 500M M M M Moore’s Law in action At this point, adding transistors to a core yields little benefit

What Does This Mean to a Programmer?
In the past, a new chip directly meant 50% higher performance for a program Today, one can expect only a 20% improvement, unless… the program can be broken up into multiple threads Expect #threads to emerge as a major metric for software quality 4-way multi-core 8-way multi-core

Challenges for the Hardware Designers
Major concerns: The performance problem (especially scientific workloads) The power dissipation problem (especially embedded processors) The temperature problem The reliability problem

The HW/SW Interface a[i] = b[i] + c; Application software Compiler
lw $15, 0($2) add $16, $15, $14 add $17, $15, $13 lw $18, 0($12) lw $19, 0($17) add $20, $18, $19 sw $20, 0($16) Systems software (OS, compiler) Assembler Hardware …

Computer Components Input/output devices
Secondary storage: non-volatile, slower, cheaper Primary storage: volatile, faster, costlier CPU/processor

Wafers and Dies

Manufacturing Process
Silicon wafers undergo many processing steps so that different parts of the wafer behave as insulators, conductors, and transistors (switches) Multiple metal layers on the silicon enable connections between transistors The wafer is chopped into many dies – the size of the die determines yield and cost

Processor Technology Trends
Shrinking of transistor sizes: 250nm (1997)  130nm (2002)  70nm (2008)  35nm (2014) Transistor density increases by 35% per year and die size increases by 10-20% per year… functionality improvements! Transistor speed improves linearly with size (complex equation involving voltages, resistances, capacitances) Wire delays do not scale down at the same rate as transistor delays

Memory and I/O Technology Trends
DRAM density increases by 40-60% per year, latency has reduced by 33% in 10 years (the memory wall!), bandwidth improves twice as fast as latency decreases Disk density improves by 100% every year, latency improvement similar to DRAM Networks: primary focus on bandwidth; 10Mb  100Mb in 10 years; 100Mb  1Gb in 5 years

Power Consumption Trends
Dyn power  activity x capacitance x voltage2 x frequency Capacitance per transistor and voltage are decreasing, but number of transistors and frequency are increasing at a faster rate Leakage power is also rising and will soon match dynamic power Power consumption is already around 100W in some high-performance processors today

Next Class Topics: MIPS instruction set architecture (Chapter 2)
Visit the class web-page Sign up for the mailing list Pick up CADE Lab passwords

Lectuure : 1 MIPS Instruction Set
Chapter : 2 Lectuure : 1 MIPS Instruction Set MIPS instructions

Recap Knowledge of hardware improves software quality:
compilers, OS, threaded programs, memory management Important trends: growing transistors, move to multi-core, slowing rate of performance improvement, power/thermal constraints, long memory/disk latencies

Instruction Set Understanding the language of the hardware is key to understanding the hardware/software interface A program (in say, C) is compiled into an executable that is composed of machine instructions – this executable must also run on future machines – for example, each Intel processor reads in the same x86 instructions, but each processor handles instructions differently Java programs are converted into portable bytecode that is converted into machine instructions during execution (just-in-time compilation) What are important design principles when defining the instruction set architecture (ISA)?

Instruction Set Important design principles when defining the
instruction set architecture (ISA): keep the hardware simple – the chip must only implement basic primitives and run fast keep the instructions regular – simplifies the decoding/scheduling of instructions

A Basic MIPS Instruction
C code: a = b + c ; Assembly code: (human-friendly machine instructions) add a, b, c # a is the sum of b and c Machine code: (hardware-friendly machine instructions) Translate the following C code into assembly code: a = b + c + d + e;

Example C code a = b + c + d + e;
translates into the following assembly code: add a, b, c add a, b, c add a, a, d or add f, d, e add a, a, e add a, a, f Instructions are simple: fixed number of operands (unlike C) A single line of C code is converted into multiple lines of assembly code Some sequences are better than others… the second sequence needs one more (temporary) variable f

Subtract Example C code f = (g + h) – (i + j);
Assembly code translation with only add and sub instructions:

Subtract Example C code f = (g + h) – (i + j);
translates into the following assembly code: add t0, g, h add f, g, h add t1, i, j or sub f, f, i sub f, t0, t sub f, f, j Each version may produce a different result because floating-point operations are not necessarily associative and commutative… more on this later

Operands In C, each “variable” is a location in memory
In hardware, each memory access is expensive – if variable a is accessed repeatedly, it helps to bring the variable into an on-chip scratchpad and operate on the scratchpad (registers) To simplify the instructions, we require that each instruction (add, sub) only operate on registers Note: the number of operands (variables) in a C program is very large; the number of operands in assembly is fixed… there can be only so many scratchpad registers

Registers The MIPS ISA has 32 registers (x86 has 8 registers) –
Why not more? Why not less? Each register is 32-bit wide (modern 64-bit architectures have 64-bit wide registers) A 32-bit entity (4 bytes) is referred to as a word To make the code more readable, registers are partitioned as $s0-$s7 (C/Java variables), $t0-$t9 (temporary variables)…

Memory Operands Values must be fetched from memory before (add and sub) instructions can operate on them Load word lw $t0, memory-address Store word sw $t0, memory-address How is memory-address determined? Memory Register Memory Register

… Memory Address The compiler organizes data in memory… it knows the
location of every variable (saved in a table)… it can fill in the appropriate mem-address for load-store instructions int a, b, c, d[10] … Memory Base address

Immediate Operands An instruction may require a constant as input
An immediate instruction uses a constant number as one of the inputs (instead of a register operand) addi $s0, $zero, # the program has base address # and this is saved in $s0 # $zero is a register that always # equals zero addi $s1, $s0, # this is the address of variable a addi $s2, $s0, # this is the address of variable b addi $s3, $s0, # this is the address of variable c addi $s4, $s0, # this is the address of variable d[0]

Memory Instruction Format
The format of a load instruction: destination register source address lw $t0, 8($t3) any register a constant that is added to the register in brackets

Example Convert to assembly: C code: d[3] = d[2] + a;

Example Convert to assembly: C code: d[3] = d[2] + a;
Assembly: # addi instructions as before lw $t0, 8($s4) # d[2] is brought into $t0 lw $t1, 0($s1) # a is brought into $t1 add $t0, $t0, $t1 # the sum is in $t0 sw $t0, 12($s4) # $t0 is stored into d[3] Assembly version of the code continues to expand!

Recap – Numeric Representations
Decimal Binary Hexadecimal (compact representation) 0x or 23hex 0-15 (decimal)  0-9, a-f (hex)

Instruction Formats Instructions are represented as 32-bit numbers (one word), broken into 6 fields R-type instruction add $t0, $s1, $s2 6 bits bits bits bits bits bits op rs rt rd shamt funct opcode source source dest shift amt function I-type instruction lw $t0, 32($s3) 6 bits bits 5 bits bits opcode rs rt constant

Logical Operations Logical ops C operators Java operators MIPS instr
Shift Left << << sll Shift Right >> >>> srl Bit-by-bit AND & & and, andi Bit-by-bit OR | | or, ori Bit-by-bit NOT ~ ~ nor

Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 Convert to assembly: if (i == j) f = g+h; else f = g-h;

Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 Convert to assembly: if (i == j) bne $s3, $s4, Else f = g+h; add $s0, $s1, $s2 else j Exit f = g-h; Else: sub $s0, $s1, $s2 Exit:

Example Convert to assembly: while (save[i] == k) i += 1;
i and k are in $s3 and $s5 and base of array save[] is in $s6

Example Convert to assembly: while (save[i] == k)
i and k are in $s3 and $s5 and base of array save[] is in $s6 Loop: sll $t1, $s3, 2 add $t1, $t1, $s6 lw $t0, 0($t1) bne $t0, $s5, Exit addi $s3, $s3, 1 j Loop Exit:

Lecture 3: MIPS Instruction Set
More MIPS instructions Procedure call/return

Memory Operands Values must be fetched from memory before (add and sub) instructions can operate on them Load word lw $t0, memory-address Store word sw $t0, memory-address How is memory-address determined? Memory Register Memory Register

… Memory Address The compiler organizes data in memory… it knows the
location of every variable (saved in a table)… it can fill in the appropriate mem-address for load-store instructions int a, b, c, d[10] … Memory Base address

Immediate Operands An instruction may require a constant as input
An immediate instruction uses a constant number as one of the inputs (instead of a register operand) addi $s0, $zero, # the program has base address # and this is saved in $s0 # $zero is a register that always # equals zero addi $s1, $s0, # this is the address of variable a addi $s2, $s0, # this is the address of variable b addi $s3, $s0, # this is the address of variable c addi $s4, $s0, # this is the address of variable d[0]

Memory Instruction Format
The format of a load instruction: destination register source address lw $t0, 8($t3) any register a constant that is added to the register in brackets

Example Convert to assembly: C code: d[3] = d[2] + a;
Assembly: # addi instructions as before lw $t0, 8($s4) # d[2] is brought into $t0 lw $t1, 0($s1) # a is brought into $t1 add $t0, $t0, $t1 # the sum is in $t0 sw $t0, 12($s4) # $t0 is stored into d[3] Assembly version of the code continues to expand!

Recap – Numeric Representations
Decimal = 3 x x 100 Binary = 1 x x x 20 Hexadecimal (compact representation) 0x or 23hex = 2 x x 160 0-15 (decimal)  0-9, a-f (hex) Dec Binary Hex Dec Binary Hex Dec Binary Hex a b Dec Binary Hex c d e f

Instruction Formats Instructions are represented as 32-bit numbers (one word), broken into 6 fields R-type instruction add $t0, $s1, $s2 6 bits bits bits bits bits bits op rs rt rd shamt funct opcode source source dest shift amt function I-type instruction lw $t0, 32($s3) 6 bits bits 5 bits bits opcode rs rt constant

Logical Operations Logical ops C operators Java operators MIPS instr
Shift Left << << sll Shift Right >> >>> srl Bit-by-bit AND & & and, andi Bit-by-bit OR | | or, ori Bit-by-bit NOT ~ ~ nor

Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 (useful for large case statements and big jumps) Convert to assembly: if (i == j) f = g+h; else f = g-h;

Control Instructions Conditional branch: Jump to instruction L1 if register1 equals register2: beq register1, register2, L1 Similarly, bne and slt (set-on-less-than) Unconditional branch: j L1 jr $s0 (useful for large case statements and big jumps) Convert to assembly: if (i == j) bne $s3, $s4, Else f = g+h; add $s0, $s1, $s2 else j Exit f = g-h; Else: sub $s0, $s1, $s2 Exit:

Example Convert to assembly: while (save[i] == k) i += 1;
i and k are in $s3 and $s5 and base of array save[] is in $s6

Example Convert to assembly: while (save[i] == k)
i and k are in $s3 and $s5 and base of array save[] is in $s6 Loop: sll $t1, $s3, 2 add $t1, $t1, $s6 lw $t0, 0($t1) bne $t0, $s5, Exit addi $s3, $s3, 1 j Loop Exit:

Procedures Each procedure (function, subroutine) maintains a scratchpad of register values – when another procedure is called (the callee), the new procedure takes over the scratchpad – values may have to be saved so we can safely return to the caller parameters (arguments) are placed where the callee can see them control is transferred to the callee acquire storage resources for callee execute the procedure place result value where caller can access it return control to caller

Registers The 32 MIPS registers are partitioned as follows:
Register 0 : $zero always stores the constant 0 Regs : $v0, $v1 return values of a procedure Regs : $a0-$a3 input arguments to a procedure Regs 8-15 : $t0-$t7 temporaries Regs 16-23: $s0-$s7 variables Regs 24-25: $t8-$t9 more temporaries Reg : $gp global pointer Reg : $sp stack pointer Reg : $fp frame pointer Reg : $ra return address

Jump-and-Link A special register (storage not part of the register file) maintains the address of the instruction currently being executed – this is the program counter (PC) The procedure call is executed by invoking the jump-and-link (jal) instruction – the current PC (actually, PC+4) is saved in the register $ra and we jump to the procedure’s address (the PC is accordingly set to this address) jal NewProcedureAddress Since jal may over-write a relevant value in $ra, it must be saved somewhere (in memory?) before invoking the jal instruction How do we return control back to the caller after completing the callee procedure?

… The Stack The register scratchpad for a procedure seems volatile –
it seems to disappear every time we switch procedures – a procedure’s values are therefore backed up in memory on a stack High address Proc A’s values Proc A call Proc B … call Proc C return Proc B’s values Proc C’s values … Stack grows this way Low address

Storage Management on a Call/Return
A new procedure must create space for all its variables on the stack Before executing the jal, the caller must save relevant values in $s0-$s7, $a0-$a3, $ra, temps into its own stack space Arguments are copied into $a0-$a3; the jal is executed After the callee creates stack space, it updates the value of $sp Once the callee finishes, it copies the return value into $v0, frees up stack space, and $sp is incremented On return, the caller may bring in its stack values, ra, temps into registers The responsibility for copies between stack and registers may fall upon either the caller or the callee

Example 1 int leaf_example (int g, int h, int i, int j) { int f ;
f = (g + h) – (i + j); return f; }

Example 1 int leaf_example (int g, int h, int i, int j) leaf_example:
{ int f ; f = (g + h) – (i + j); return f; } leaf_example: addi $sp, $sp, -12 sw $t1, 8($sp) sw $t0, 4($sp) sw $s0, 0($sp) add $t0, $a0, $a1 add $t1, $a2, $a3 sub $s0, $t0, $t1 add $v0, $s0, $zero lw $s0, 0($sp) lw $t0, 4($sp) lw $t1, 8($sp) addi $sp, $sp, 12 jr $ra Notes: In this example, the procedure’s stack space was used for the caller’s variables, not the callee’s – the compiler decided that was better. The caller took care of saving its $ra and $a0-$a3.

Example 2 int fact (int n) { if (n < 1) return (1);
else return (n * fact(n-1)); }

else return (n * fact(n-1)); } fact: addi $sp, $sp, -8 sw $ra, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 1 beq $t0, $zero, L1 addi $v0, $zero, 1 addi $sp, $sp, 8 jr $ra L1: addi $a0, $a0, -1 jal fact lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 mul $v0, $a0, $v0 jr $ra Notes: The caller saves $a0 and $ra in its stack space. Temps are never saved.

Memory Organization The space allocated on stack by a procedure is termed the activation record (includes saved values and data local to the procedure) – frame pointer points to the start of the record and stack pointer points to the end – variable addresses are specified relative to $fp as $sp may change during the execution of the procedure $gp points to area in memory that saves global variables Dynamically allocated storage (with malloc()) is placed on the heap Stack Dynamic data (heap) Static data (globals) Text (instructions)

Lecture 4: Procedure Calls
Large constants The compilation process

Recap The jal instruction is used to jump to the procedure and
save the current PC (+4) into the return address register Arguments are passed in $a0-$a3; return values in $v0-$v1 Since the callee may over-write the caller’s registers, relevant values may have to be copied into memory Each procedure may also require memory space for local variables – a stack is used to organize the memory needs for each procedure

… The Stack The register scratchpad for a procedure seems volatile –
it seems to disappear every time we switch procedures – a procedure’s values are therefore backed up in memory on a stack High address Proc A’s values Proc A call Proc B … call Proc C return Proc B’s values Proc C’s values … Stack grows this way Low address

Example 1 int leaf_example (int g, int h, int i, int j) { int f ;
f = (g + h) – (i + j); return f; }

Example 1 int leaf_example (int g, int h, int i, int j) leaf_example:
{ int f ; f = (g + h) – (i + j); return f; } leaf_example: addi $sp, $sp, -12 sw $t1, 8($sp) sw $t0, 4($sp) sw $s0, 0($sp) add $t0, $a0, $a1 add $t1, $a2, $a3 sub $s0, $t0, $t1 add $v0, $s0, $zero lw $s0, 0($sp) lw $t0, 4($sp) lw $t1, 8($sp) addi $sp, $sp, 12 jr $ra Notes: In this example, the procedure’s stack space was used for the caller’s variables, not the callee’s – the compiler decided that was better. The caller took care of saving its $ra and $a0-$a3.

else return (n * fact(n-1)); }

else return (n * fact(n-1)); } fact: addi $sp, $sp, -8 sw $ra, 4($sp) sw $a0, 0($sp) slti $t0, $a0, 1 beq $t0, $zero, L1 addi $v0, $zero, 1 addi $sp, $sp, 8 jr $ra L1: addi $a0, $a0, -1 jal fact lw $a0, 0($sp) lw $ra, 4($sp) addi $sp, $sp, 8 mul $v0, $a0, $v0 jr $ra Notes: The caller saves $a0 and $ra in its stack space. Temps are never saved.

Memory Organization The space allocated on stack by a procedure is termed the activation record (includes saved values and data local to the procedure) – frame pointer points to the start of the record and stack pointer points to the end – variable addresses are specified relative to $fp as $sp may change during the execution of the procedure $gp points to area in memory that saves global variables Dynamically allocated storage (with malloc()) is placed on the heap Stack Dynamic data (heap) Static data (globals) Text (instructions)

Dealing with Characters
Instructions are also provided to deal with byte-sized and half-word quantities: lb (load-byte), sb, lh, sh These data types are most useful when dealing with characters, pixel values, etc. C employs ASCII formats to represent characters – each character is represented with 8 bits and a string ends in the null character (corresponding to the 8-bit number 0)

Example Convert to assembly: void strcpy (char x[], char y[]) { int i;
while ((x[i] = y[i]) != `\0’) i += 1; }

Example Convert to assembly: strcpy: void strcpy (char x[], char y[])
{ int i; i=0; while ((x[i] = y[i]) != `\0’) i += 1; } strcpy: addi $sp, $sp, -4 sw $s0, 0($sp) add $s0, $zero, $zero L1: add $t1, $s0, $a1 lb $t2, 0($t1) add $t3, $s0, $a0 sb $t2, 0($t3) beq $t2, $zero, L2 addi $s0, $s0, 1 j L1 L2: lw $s0, 0($sp) addi $sp, $sp, 4 jr $ra

Large Constants Immediate instructions can only specify 16-bit constants The lui instruction is used to store a 16-bit constant into the upper 16 bits of a register… thus, two immediate instructions are used to specify a 32-bit constant The destination PC-address in a conditional branch is specified as a 16-bit constant, relative to the current PC A jump (j) instruction can specify a 26-bit constant; if more bits are required, the jump-register (jr) instruction is used

Starting a Program x.c Compiler x.s Assembler x.a, x.so x.o Linker
C Program x.c Compiler Assembly language program x.s Assembler x.a, x.so x.o Object: machine language module Object: library routine (machine language) Linker Executable: machine language program a.out Loader Memory

Role of Assembler Convert pseudo-instructions into actual hardware
instructions – pseudo-instrs make it easier to program in assembly – examples: “move”, “blt”, 32-bit immediate operands, etc. Convert assembly instrs into machine instrs – a separate object file (x.o) is created for each C file (x.c) – compute the actual values for instruction labels – maintain info on external references and debugging information

Role of Linker Stitches different object files into a single executable patch internal and external references determine addresses of data and instruction labels organize code and data modules in memory Some libraries (DLLs) are dynamically linked – the executable points to dummy routines – these dummy routines call the dynamic linker-loader so they can update the executable to jump to the correct routine

Full Example – Sort in C void sort (int v[], int n) { int i, j;
for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); } void swap (int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } Allocate registers to program variables Produce code for the program body Preserve registers across procedure invocations

The swap Procedure Register allocation: $a0 and $a1 for the two arguments, $t0 for the temp variable – no need for saves and restores as we’re not using $s0-$s7 and this is a leaf procedure (won’t need to re-use $a0 and $a1) swap: sll $t1, $a1, 2 add $t1, $a0, $t1 lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) jr $ra

The sort Procedure Register allocation: arguments v and n use $a0 and $a1, i and j use $s0 and $s1; must save $a0 and $a1 before calling the leaf procedure The outer for loop looks like this: (note the use of pseudo-instrs) move $s0, $zero # initialize the loop loopbody1: bge $s0, $a1, exit1 # will eventually use slt and beq … body of inner loop … addi $s0, $s0, 1 j loopbody1 exit1: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }

The sort Procedure The inner for loop looks like this:
addi $s1, $s0, # initialize the loop loopbody2: blt $s1, $zero, exit2 # will eventually use slt and beq sll $t1, $s1, 2 add $t2, $a0, $t1 lw $t3, 0($t2) lw $t4, 4($t2) bgt $t3, $t4, exit2 … body of inner loop … addi $s1, $s1, -1 j loopbody2 exit2: for (i=0; i<n; i+=1) { for (j=i-1; j>=0 && v[j] > v[j+1]; j-=1) { swap (v,j); }

Saves and Restores Since we repeatedly call “swap” with $a0 and $a1, we begin “sort” by copying its arguments into $s2 and $s3 – must update the rest of the code in “sort” to use $s2 and $s3 instead of $a0 and $a1 Must save $ra at the start of “sort” because it will get over-written when we call “swap” Must also save $s0-$s3 so we don’t overwrite something that belongs to the procedure that called “sort”

Saves and Restores sort: addi $sp, $sp, -20 sw $ra, 16($sp)
sw $s3, 12($sp) sw $s2, 8($sp) sw $s1, 4($sp) sw $s0, 0($sp) move $s2, $a0 move $s3, $a1 … move $a0, $s # the inner loop body starts here move $a1, $s1 jal swap exit1: lw $s0, 0($sp) addi $sp, $sp, 20 jr $ra 9 lines of C code  35 lines of assembly

Relative Performance Gcc optimization Relative Cycles Instruction CPI
performance count none B B O B B O B B O B B A Java interpreter has relative performance of 0.12, while the Jave just-in-time compiler has relative performance of 2.13 Note that the quicksort algorithm is about three orders of magnitude faster than the bubble sort algorithm (for 100K elements)

Lecture 5: MIPS Examples
Today’s topics: the compilation process full example – sort in C Reminder: 2nd assignment will be posted later today