CSE 502: Computer Architecture

CSE 502: Computer Architecture
Review

Course Overview (1/2) Caveat 1: I’m (kind of) new here.
Caveat 2: This is a (somewhat) new course. Computer Architecture is … the science and art of selecting and interconnecting hardware and software components to create computers …

Course Overview (2/2) This course is hard, roughly like CSE 506
In CSE 506, you learn what’s inside an OS In CSE 502, you learn what’s inside a CPU This is a project course Learn why things are the way they are, first hand We will “build” emulators of CPU components

Policy and Projects Probably different from other classes
Much more open, but much more strict Most people followed the policy Some did not Resembles the “real world” You’re here because you want to learn and to be here If you managed to get your partner(s) to do the work You’re probably good enough to do it at your job too The good: You might make a good manager The bad: You didn’t learn much Time mgmt. often more important than tech. skill If you started early, you probably have an A already

Amdahl’s Law f (1 - f) f (1 - f) 1 (1 - f) f/S (1 - f)
Speedup = timewithout enhancement / timewith enhancement An enhancement speeds up fraction f of a task by factor S timenew = timeorig·( (1-f) + f/S ) Soverall = 1 / ( (1-f) + f/S ) f (1 - f) timeorig f (1 - f) timeorig 1 timeorig Real life analogy: After driving through 60 minutes of traffic jam, how much time can you make up by speeding in the final mile? Applications in Computer Architecture RISC - Reduced Instruction Set Computer Optimized to execute frequently used instructions quickly Infrequently used instructions can take a long time, or even emulated by software We should concentrate efforts on improving frequently occurring events or frequently used mechanisms (1 - f) timenew f/S (1 - f) timenew f/S

The Iron Law of Processor Performance
1/f (frequency) Total Work In Program CPI or 1/IPC Algorithms, Compilers, ISA Extensions Microarchitecture Microarchitecture, Process Tech Architects target CPI, but must understand the others

Averaging Performance Numbers (2/2)
Arithmetic: times proportional to time e.g., latency Harmonic: rates inversely proportional to time e.g., throughput Geometric: ratios unit-less quantities e.g., speedups

Power vs. Energy Power: instantaneous rate of energy transfer
Expressed in Watts In Architecture, implies conversion of electricity to heat Power(Comp1+Comp2)=Power(Comp1)+Power(Comp2) Energy: measure of using power for some time Expressed in Joules power * time (joules = watts * seconds) Energy(OP1+OP2)=Energy(OP1)+Energy(OP2) What uses power in a chip?

ISA: A contract between HW and SW
ISA: Instruction Set Architecture A well-defined hardware/software interface The “contract” between software and hardware Functional definition of operations supported by hardware Precise description of how to invoke all features No guarantees regarding How operations are implemented Which operations are fast and which are slow (and when) Which operations take more energy (and which take less)

ISAs last forever, don’t add stuff you don’t need
Components of an ISA Programmer-visible states Program counter, general purpose registers, memory, control registers Programmer-visible behaviors What to do, when to do it A binary encoding if imem[pc]==“add rd, rs, rt” then pc  pc+1 gpr[rd]=gpr[rs]+grp[rt] Example “register-transfer-level” description of an instruction ISAs last forever, don’t add stuff you don’t need

Locality Principle Recent past is a good indication of near future
Spatial Locality: If you looked something up, it is very likely you will look up something nearby soon Temporal Locality: If you looked something up, it is very likely that you will look it up again soon Real life analogy: spatial locality - where you choose to sit in a room temporal locality - will you be here again next week? Examples in computer architecture: Execution of program loops spatial locality - after you execute an instruction, with very good probability, you will execute the next instruction temporal locality - you are very likely to repeat the same instructions many times

Caches An automatically managed hierarchy
Break memory into blocks (several bytes) and transfer data to/from cache in blocks spatial locality Keep recently accessed blocks temporal locality Core $ Memory

Fully-Associative Cache
Keep blocks in cache frames data state (e.g., valid) address tag 63 address tag[63:6] block offset[5:0] state tag = data state tag = data state tag = data state tag = data multiplexor hit? What happens when the cache runs out of space?

The 3 C’s of Cache Misses Compulsory: Never accessed before
Capacity: Accessed long ago and already replaced Conflict: Neither compulsory nor capacity Coherence: (In multi-cores, become owner to write)

Cache Size Cache size is data capacity (don’t count tag and state)
Bigger can exploit temporal locality better Not always better Too large a cache Smaller is faster  bigger is slower Access time may hurt critical path Too small a cache Limited temporal locality Useful data constantly replaced working set size hit rate capacity

Block Size Block size is the data that is Too small a block
Associated with an address tag Not necessarily the unit of transfer between hierarchies Too small a block Don’t exploit spatial locality well Excessive tag overhead Too large a block Useless data transferred Too few total blocks Useful data frequently replaced hit rate block size

Direct-Mapped Cache Use middle bits as index Only one tag comparison
block offset[5:0] data state tag data state tag data state tag decoder data state tag multiplexor tag match (hit?) =

N-Way Set-Associative Cache
tag[63:15] index[14:6] block offset[5:0] way data state tag data state tag set data state tag data state tag data state tag data state tag decoder decoder data state tag data state tag multiplexor multiplexor = = multiplexor hit? Note the additional bit(s) moved from index to tag

Associativity Larger associativity Smaller associativity
lower miss rate (fewer conflicts) higher power consumption Smaller associativity lower cost faster hit time hit rate ~5 for L1-D associativity

Parallel vs Serial Caches
Tag and Data usually separate (tag is smaller & faster) State bits stored along with tags Valid bit, “LRU” bit(s), … Parallel access to Tag and Data reduces latency (good for L1) Serial access to Tag and Data reduces power (good for L2+) enable = = = = = = = = valid? valid? hit? data hit? data

Physically-Indexed Caches
Virtual Address Core requests are VAs Cache index is PA[15:6] VA passes through TLB D-TLB on critical path Cache tag is PA[63:16] If index size < page size Can use VA for index tag[63:14] index[13:6] block offset[5:0] virtual page[63:13] page offset[12:0] D-TLB / index[6:0] physical index[7:0] / / physical index[0:0] / physical tag[51:1] = = = =

Virtually-Indexed Caches
Virtual Address Core requests are VAs Cache index is VA[15:6] Cache tag is PA[63:16] Why not tag with VA? Cache flush on ctx switch Virtual aliases Ensure they don’t exist … or check all on miss tag[63:14] index[13:6] block offset[5:0] virtual page[63:13] page offset[12:0] / virtual index[7:0] One bit overlaps D-TLB / physical tag[51:0] = = = =

Inclusion Core often accesses blocks not present on chip
Should block be allocated in L3, L2, and L1? Called Inclusive caches Waste of space Requires forced evict (e.g., force evict from L1 on evict from L2+) Only allocate blocks in L1 Called Non-inclusive caches (who not “exclusive”?) Must write back clean lines Some processors combine both L3 is inclusive of L1 and L2 L2 is non-inclusive of L1 (like a large victim cache)

Parity & ECC Cosmic radiation can strike at any time What can be done?
Especially at high altitude Or during solar flares What can be done? Parity 1 bit to indicate if sum is odd/even (detects single-bit errors) Error Correcting Codes (ECC) 8 bit code per 64-bit word Generally SECDED (Single-Error-Correct, Double-Error-Detect) Detecting errors on clean cache lines is harmless Pretend it’s a cache miss

SRAM vs. DRAM SRAM = Static RAM DRAM = Dynamic RAM SRAM: 6T per bit
As long as power is present, data is retained DRAM = Dynamic RAM If you don’t do anything, you lose the data SRAM: 6T per bit built with normal high-speed CMOS technology DRAM: 1T per bit (+1 capacitor) built with special DRAM process optimized for density Again, should be review for ECE students… CS students may not have seen this type of stuff.

DRAM Chip Organization
Low-Level organization is very similar to SRAM Cells are only single-ended Reads destructive: contents are erased by reading Row buffer holds read data Data in row buffer is called a DRAM row Often called “page” - not necessarily same as OS page Read gets entire row into the buffer Block reads always performed out of the row buffer Reading a whole row, but accessing one block Similar to reading a cache line, but accessing one word

All banks are independent,
DRAM Organization DIMM x8 DRAM All banks within the rank share all address and control pins Rank DRAM DRAM DRAM DRAM Bank DRAM DRAM All banks are independent, but can only talk to one bank at a time DRAM DRAM x8 means each DRAM outputs 8 bits, need 8 chips for DDRx (64-bit) DRAM DRAM DRAM DRAM DRAM DRAM x8 DRAM Why 9 chips per rank? 64 bits data, 8 bits ECC DRAM DRAM DRAM DRAM Dual-rank x8 (2Rx8) DIMM

In many cases, MLP dictates performance
AMAT with MLP If … cache hit is 10 cycles (core to L1 and back) memory access is 100 cycles (core to mem and back) Then … at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55 Unless MLP is >1.0, then… at 50% mr,1.5 MLP,avg. access:(0.5×10+0.5×100)/1.5 = 37 at 50% mr,4.0 MLP,avg. access:(0.5×10+0.5×100)/4.0 = 14 In many cases, MLP dictates performance

Memory Controller (1/2) Memory Controller Commands Read Queue Write
Response Queue Data To/From CPU Scheduler Buffer Channel 0 Channel 1

Memory Controller (2/2) Memory controller connects CPU and DRAM
Receives requests after cache misses in LLC Possibly originating from multiple cores Complicated piece of hardware, handles: DRAM Refresh Row-Buffer Management Policies Address Mapping Schemes Request Scheduling

Address Mapping Schemes
Example Open-page Mapping Scheme: High Parallelism: [row rank bank column channel offset] Easy Expandability: [channel rank row bank column offset] Example Close-page Mapping Scheme: High Parallelism: [row column rank bank channel offset] Easy Expandability: [channel rank row column bank offset]

Memory Request Scheduling
Write buffering Writes can wait until reads are done Queue DRAM commands Usually into per-bank queues Allows easily reordering ops. meant for same bank Common policies: First-Come-First-Served (FCFS) First-Ready—First-Come-First-Served (FR-FCFS)

Prefetching (1/2) Fetch block ahead of demand
Target compulsory, capacity, (& coherence) misses Not conflict: prefetched block would conflict Big challenges: Knowing “what” to fetch Fetching useless blocks wastes resources Knowing “when” to fetch Too early  clutters storage (or gets thrown out before use) Fetching too late  defeats purpose of “pre”-fetching

Prefetching (2/2) Prefetching must be accurate and timely
Without prefetching: With prefetching: Or: L1 L2 Data DRAM Load Total Load-to-Use Latency time Prefetch Data Load Much improved Load-to-Use Latency Prefetch Data Load Somewhat improved Latency Prefetching must be accurate and timely

Next-Line (or Adjacent-Line) Prefetching
On request for line X, prefetch X+1 (or X^0x1) Assumes spatial locality Often a good assumption Should stop at physical (OS) page boundaries Can often be done efficiently Adjacent-line is convenient when next-level block is bigger Prefetch from DRAM can use bursts and row-buffer hits Works for I$ and D$ Instructions execute sequentially Large data structures often span multiple blocks crossing page boundaries can cause issues. First, the page may not be mapped and you probably don’t want to take a page fault due to a prefetch that you don’t even know for sure whether it’ll be useful or not. Second, the next physically contiguous page may not have anything to do with where the next virtual page is physically located. Simple, but usually not timely

Next-N-Line Prefetching
On request for line X, prefetch X+1, X+2, …, X+N N is called “prefetch depth” or “prefetch degree” Must carefully tune depth N. Large N is … More likely to be useful (correct and timely) More aggressive  more likely to make a mistake Might evict something useful More expensive  need storage for prefetched lines Might delay useful request on interconnect or port Still simple, but more timely than Next-Line

Elements in array of structs
Stride Prefetching Elements in array of structs Column in matrix Access patterns often follow a stride Accessing column of elements in a matrix Accessing elements in array of structs Detect stride S, prefetch depth N Prefetch X+1∙S, X+2∙S, …, X+N∙S

“Localized” Stride Prefetchers
Store PC, last address, last stride, and count in RPT On access, check RPT (Reference Prediction Table) Same stride?  count++ if yes, count-- or count=0 if no If count is high, prefetch (last address + stride*N) Tag Last Addr Stride Count PCa: 0x409A34 Load R1 = [R2] 0x409 A+3N N 2 If confident about the stride (count > Cmin), prefetch (A+4N) PCb: 0x409A38 Load R3 = [R4] + 0x409 X+3N N 2 PCc: 0x409A40 Store [R6] = R5 0x409 Y+2N N 1

Evaluating Prefetchers
Compare against larger caches Complex prefetcher vs. simple prefetcher with larger cache Primary metrics Coverage: prefetched hits / base misses Accuracy: prefetched hits / total prefetches Timeliness: latency of prefetched blocks / hit latency Secondary metrics Pollution: misses / (prefetched hits + base misses) Bandwidth: total prefetches + misses / base misses Power, Energy, Area...

Before there was pipelining…
Single-cycle insn0.(fetch,decode,exec) insn1.(fetch,decode,exec) Multi-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec time Single-cycle control: hardwired Low CPI (1) Long clock period (to accommodate slowest instruction) Multi-cycle control: micro-programmed Short clock period High CPI Can we have both low CPI and short clock period?

Can have as many insns in flight as there are stages
Pipelining Multi-cycle insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec insn0.fetch insn0.dec insn0.exec Pipelined insn1.fetch insn1.dec insn1.exec time insn2.dec insn2.fetch insn2.exec Start with multi-cycle design When insn0 goes from stage 1 to stage 2 … insn1 starts stage 1 Each instruction passes through all stages … but instructions enter and leave at faster rate Can have as many insns in flight as there are stages

Instruction Dependencies
Data Dependence Read-After-Write (RAW) (only true dependence) Read must wait until earlier write finishes Anti-Dependence (WAR) Write must wait until earlier read finishes (avoid clobbering) Output Dependence (WAW) Earlier write can’t overwrite later write Control Dependence (a.k.a. Procedural Dependence) Branch condition must execute before branch target Instructions after branch cannot run before branch

Pipeline Terminology Pipeline Hazards Hazard Resolution
Potential violations of program dependencies Must ensure program dependencies are not violated Hazard Resolution Static method: performed at compile time in software Dynamic method: performed at runtime using hardware Two options: Stall (costs perf.) or Forward (costs hw.) Pipeline Interlock Hardware mechanism for dynamic hazard resolution Must detect and enforce dependencies at runtime

Simple 5-stage Pipeline
U X + 1 + target PC+1 PC+1 R0 eq? regA ALU result R1 Register file regB A L U R2 instruction valA M U X PC Inst Cache Data Cache R3 ALU result mdata R4 R5 valB R6 M U X data R7 offset dest valB M U X dest dest dest op op op IF/ID ID/EX EX/Mem Mem/WB

Balancing Pipeline Stages
Coarser-Grained Machine Cycle: 4 machine cyc / instruction Finer-Grained Machine Cycle: 11 machine cyc /instruction TIF&ID= 8 units TOF= 9 units TEX= 5 units TOS= 9 units IF ID OF WB EX # stages = 11 Tcyc= 3 units IF ID OF EX WB # stages = 4 Tcyc= 9 units

IPC vs. Frequency 10-15% IPC not bad if frequency can double
Frequency doesn’t double Latch/pipeline overhead Stage imbalance 1000ps 500ps 500ps 2.0 IPC, 1GHz 1.7 IPC, 2GHz 2 BIPS 3.4 BIPS 900ps 450ps 450ps Just pointing out that the ideal performance (double clock speed combined with 10-15% IPC hit) is not likely achievable due to many other issues. 900ps 350 550 1.5GHz

Architectures for Instruction Parallelism
Scalar pipeline (baseline) Instruction/overlap parallelism = D Operation Latency = 1 Peak IPC = 1.0 D D different instructions overlapped Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles

Superscalar Machine Superscalar (pipelined) Execution
Instruction parallelism = D x N Operation Latency = 1 Peak IPC = N per cycle D x N different instructions overlapped N Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles

RISC ISA Format Fixed-length Few formats
MIPS all insts are 32-bits/4 bytes Few formats MIPS has 3: R (reg, reg, reg), I (reg, reg, imm), J (addr) Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP Regularity across formats (when possible/practical) MIPS & Alpha opcode in same bit-position for all formats MIPS rs & rt fields in same bit-position for R and I formats Alpha ra/fa field in same bit-position for all 5 formats

Superscalar Decode for RISC ISAs
Decode X insns. per cycle (e.g., 4-wide) Just duplicate the hardware Instructions aligned at 32-bit boundaries 1-Fetch Decoder 32-bit inst decoded inst superscalar 4-wide superscalar fetch 32-bit inst Decoder decoded inst scalar

CISC ISA RISC focus on fast access to information
Easy decode, I$, large RF’s, D$ CISC focus on max expressiveness per min space Designed in era with fewer transistors, chips Each memory access very expensive Pack as much work into as few bytes as possible More “expressive” instructions Better potential code generation in theory More complex code generation in practice

ADD in RISC ISA Mode Example Meaning Register ADD R4, R3, R2

ADD in CISC ISA Mode Example Meaning Register ADD R4, R3 R4 = R4 + R3
Immediate ADD R4, #3 R4 = R4 + 3 Displacement ADD R4, 100(R1) R4 = R4 + Mem[100+R1] Register Indirect ADD R4, (R1) R4 = R4 + Mem[R1] Indexed/Base ADD R3, (R1+R2) R3 = R3 + Mem[R1+R2] Direct/Absolute ADD R1, (1234) R1 = R1 + Mem[1234] Memory Indirect ADD R1 = R1 + Mem[Mem[R3]] Auto-Increment ADD R1,(R2)+ R1 = R1 + Mem[R2]; R2++ Auto-Decrement ADD R1, -(R2) R2--; R1 = R1 + Mem[R2]

RISC (MIPS) vs CISC (x86) lui R1, Disp[31:16] ori R1, R1, Disp[15:0] add R1, R1, R2 shli R3, R3, 3 add R3, R3, R1 lui R1, Imm[31:16] ori R1, R1, Imm[15:0] st [R3], R1 MOV [EBX+EAX*8+Disp], Imm 8 insns. at 32 bits each vs 1 insn. at 88 bits: 2.9x!

Instruction length not known until after decode
x86 Encoding Basic x86 Instruction: Prefixes 0-4 bytes Opcode 1-2 bytes Mod R/M 0-1 bytes SIB 0-1 bytes Displacement 0/1/2/4 bytes Immediate 0/1/2/4 bytes Longest Inst 15 bytes Shortest Inst: 1 byte Opcode has flag indicating Mod R/M is present Most instructions use the Mod R/M byte Mod R/M specifies if optional SIB byte is used Mod R/M and SIB may specify additional constants Instruction length not known until after decode

Instruction Cache Organization
To fetch N instructions per cycle... L1-I line must be wide enough for N instructions PC register selects L1-I line A fetch group is the set of insns. starting at PC For N-wide machine, [PC,PC+N-1] PC Cache Line Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst

Fetch Misalignment Now takes two cycles to fetch N instructions
PC: xxx01001 00 01 10 11 000 001 Tag Inst Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder Cycle 2 111 Tag Inst Inst Inst Inst PC: xxx01100 000 00 01 10 11 Cycle 1 001 Tag Inst Inst Inst Inst “Reduction may not be as bad as a full halving”: just because you fetched only K < N instructions during cycle 1 does not limit you to only fetching N-K instructions in cycle 2. Inst Inst Inst 010 Tag Inst Inst Inst Inst 011 Tag Inst Inst Inst Inst Tag Inst Inst Inst Inst Decoder Inst 111 Tag Inst Inst Inst Inst Inst

Fragmentation due to Branches
Fetch group is aligned, cache line size > fetch group Taken branches still limit fetch width Decoder Tag Inst Branch X

Need direction and target to find next fetch group
Types of Branches Direction: Conditional vs. Unconditional Target: PC-encoded PC-relative Absolute offset Computed (target derived from register) Need direction and target to find next fetch group

Branch Prediction Overview
Use two hardware predictors Direction predictor guesses if branch is taken or not-taken Target predictor guesses the destination PC Predictions are based on history Use previous behavior as indication of future behavior Use historical context to disambiguate predictions This lecture does not discuss how to predict the direction of branches (T vs. NT)… see next lecture for that.

Direction vs. Target Prediction
Direction: 0 or 1 Target: 32- or 64-bit value Turns out targets are generally easier to predict Don’t need to predict N-t target T target doesn’t usually change Only need to predict taken-branch targets Prediction is really just a “cache” Branch Target Buffer (BTB) Target Pred Be careful about whether you add “sizeof(inst)” or “sizeof(cacheline)” to the PC (and really it’s the PC of the start of the cacheline if you’re adding “sizeof(cacheline)”). + sizeof(inst) PC

Branch Target Buffer (BTB)
Branch Instruction Address (Tag) Branch PC V BIA BTA Valid Bit Branch Target Address = Next Fetch PC Hit?

Fewer bits to compare, but prediction may alias
BTB w/Partial Tags v cfff981 cfff9704 cfff9810 v cfff982 cfff9830 cfff9824 v cfff984 cfff9900 cfff984c beef9810 cfff9810 cfff9824 cfff984c v f981 cfff9704 f982 cfff9830 f984 cfff9900 May lead to false hits, as shown by the red address. Fewer bits to compare, but prediction may alias

BTB w/PC-offset Encoding
v f981 cfff9704 v f982 cfff9830 cfff984c v f984 cfff9900 v f981 ff9704 v f982 ff9830 cfff984c Branch targets are usually close by, which results in the upper bits of the target’s address usually being identical to those in the original PC. v f984 ff9900 cf ff9900 If target too far or PC rolls over, will mispredict

Branches Have Locality
If a branch was previously taken… There’s a good chance it’ll be taken again for(i=0; i < ; i++) { /* do stuff */ } This branch will be taken 99,999 times in a row.

Last Outcome Predictor
Do what you did last time 0xDC08: for(i=0; i < ; i++) { 0xDC44: if( ( i % 100) == 0 ) tick( ); 0xDC50: if( (i & 1) == 1) odd( ); } T I.e., 1-bit counter N

Saturating Two-Bit Counter
Predict N-t Predict T Transition on T outcome 1 2 3 FSM for 2bC (2-bit Counter) Transition on N-t outcome 1 FSM for Last-Outcome Prediction

Typical Organization of 2bC Predictor
PC hash 32 or 64 bits n entries/counters log2 n bits FSM Update Logic table update Actual outcome Prediction

Track the History of Branches
Previous Outcome PC Counter if prev=0 1 3 Counter if prev=1 1 3 prev = 1 3 prediction = T  prev = 0 3 prediction = T 2 In the animated example, the left circle corresponds to the 2bC used when the previous outcome was 0, and the right corresponds to 1. The not-used counter is shaded darker. prev = 1 3 prediction = T 2 prev = 1 3 prediction = T

Deeper History Covers More Patterns
Counters learn “pattern” of prediction Previous 3 Outcomes Counter if prev=000 Counter if prev=001 PC Counter if prev=010 1 1 3 1 3 2 2 Counter if prev=111 001  1; 011  0; 110  0; 100  1 … (0011)*

Predictor Training Time
Ex: prediction equals opposite for 2nd most recent Hist Len = 2 4 states to train: NN  T NT  T TN  N TT  N Hist Len = 3 8 states to train: NNN  T NNT  T NTN  N NTT  N TNN  T TNT  T TTN  N TTT  N

Predictor Organizations
PC Hash PC Hash PC Hash Each trades off aliasing in different places. The first suffers from different static branches mapping into the same local history and counters. The second allows different static branches that exhibit the same local history to map into the same counters. The figures do not imply that the total number of branch history registers in the three figures are necessarily the same. Different pattern for each branch PC Shared set of patterns Mix of both

Two-Level Predictor Organization
Branch History Table (BHT) 2a entries h-bit history per entry Pattern History Table (PHT) 2b sets 2h counters per set Total Size in bits h2a + 2(b+h)2 PC Hash a h b Each entry is a 2-bit counter

Combined Indexing “gshare” (S. McFarling) PC Hash k k XOR
k = log2counters

OoO Execution Out-of-Order execution (OoO)
Totally in the hardware Also called Dynamic scheduling Fetch many instructions into instruction window Use branch prediction to speculate past branches Rename regs. to avoid false deps. (WAW and WAR) Execute insns. as soon as possible As soon as deps. (regs and memory) are known Today’s machines: 100+ insns. scheduling window

Superscalar != Out-of-Order
cache miss 1-wide In-Order A cache miss 2-wide In-Order A 1-wide Out-of-Order A cache miss 2-wide Out-of-Order A: R1 = Load 16[R2] B: R3 = R1 + R4 C: R6 = Load 8[R9] D: R5 = R2 – 4 E: R7 = Load 20[R5] F: R4 = R4 – 1 G: BEQ R4, #0 C D E F G B 5 cycles cache miss C D E B C D E F G 10 cycles B C D E F G 8 cycles B F G 7 cycles Superscalar/In-order is not uncommon; Out-of-order/Single-Issue is not common (but possible). A C D F B E G

Review of Register Dependencies
Read-After-Write A: R1 = R3 / R4 B: R3 = R2 * R4 Write-After-Read 5 -2 9 3 R1 R2 R3 R4 -6 A B Write-After-Write A: R1 = R2 + R3 B: R1 = R3 * R4 5 -2 9 3 R1 R2 R3 R4 7 27 A B A: R1 = R2 + R3 B: R4 = R1 * R4 A R1 5 7 7 R2 -2 -2 -2 R3 9 9 9 B R4 3 3 21 5 -2 9 3 R1 R2 R3 R4 15 7 B A 5 -2 9 3 R1 R2 R3 R4 -6 A B 5 -2 9 3 R1 R2 R3 R4 27 7 A B This should be review. Each is an example of how you will get the wrong results if you reorder two instructions that have a register dependency.

Register Renaming Register renaming (in hardware) How does it work?
“Change” register names to eliminate WAR/WAW hazards Arch. registers (r1,f0…) are names, not storage locations Can have more locations than names Can have multiple active versions of same name How does it work? Map-table: maps names to most recent locations On a write: allocate new location, note in map-table On a read: find location of most recent write via map-table

Tomasulo’s Algorithm Reservation Stations (RS): instruction buffer
Common data bus (CDB): broadcasts results to RS Register renaming: removes WAR/WAW hazards Bypassing (not shown here to make example simpler)

Tomasulo Data Structures
Regfile Map Table T value CDB.T CDB.V Fetched insns R op T T1 T2 V1 V2 == == == == == == == == Reservation Stations T FU

Where is the “register rename”?
Regfile Map Table T value CDB.T CDB.V Fetched insns R op T T1 T2 V1 V2 == == == == == == == == Reservation Stations T FU Value copies in RS (V1, V2) Insn. stores correct input values in its own RS entry “Free list” is implicit (allocate/deallocate as part of RS)

Precise State Speculative execution requires
(Ability to) abort & restart at every branch Abort & restart at every load Synchronous (exception and trap) events require Abort & restart at every load, store, divide, … Asynchronous (hardware) interrupts require Abort & restart at every ?? Real world: bite the bullet Implement abort & restart at every insn. Called precise state

Complete and Retire Complete (C): insns. write results into ROB
Re-Order Buffer (ROB) regfile I$ L1-D B P C R Complete (C): insns. write results into ROB Out-of-order: don’t block younger insns. Retire (R): a.k.a. commit, graduate ROB writes results to register file In-order: stall back-propagates to younger insns.

P6 Data Structures Regfile Map Table T+ value R value Head Retire Tail
Dispatch CDB.T CDB.V op T T1 T2 V1 V2 ROB == == == == Dispatch == == == == RS T FU

MIPS R10K: Alternative Implementation
Regfile Map Table T+ T T R T Told value Head Retire Tail Dispatch Arch. Map Free List op T T1+ T2+ ROB == == == == Dispatch == == == == RS T CDB.T FU One big physical register file holds all data - no copies Register file close to FUs  small and fast data path ROB and RS “on the side” used only for control and tags

Executing Memory Instructions
If R1 != R7 Then Load R8 gets correct value from cache If R1 == R7 Then Load R8 should get value from the Store But it didn’t! Load R3 = 0[R6] Issue Miss serviced… Cache Miss! Add R7 = R3 + R9 Issue Store R4  0[R7] Issue Basic example of address-based dependency ambiguities Sub R1 = R1 – R2 Load R8 = 0[R1] Issue Cache Hit! But there was a later load…

Memory Disambiguation Problem
Ordering problem is a data-dependence violation Imprecise memory worse than imprecise registers Why can’t this happen with non-memory insts? Operand specifiers in non-memory insns. are absolute “R1” refers to one specific location Operand specifiers in memory insns. are ambiguous “R1” refers to a memory location specified by the value of R1. When pointers (e.g., R1) change, so does this location

Two Problems Memory disambiguation on loads
Do earlier unexecuted stores to the same address exist? Binary question: answer is yes or no Store-to-load forwarding problem I’m a load: Which earlier store do I get my value from? I’m a store: Which later load(s) do I forward my value to? Non-binary question: answer is one or more insn. identifiers

Load/Store Queue (1/2) Load/store queue (LSQ)
Completed stores write to LSQ When store retires, head of LSQ written to L1-D (or write buffer) When loads execute, access LSQ and L1-D in parallel Forward from LSQ if older store with matching address

Almost a “real” processor diagram
Load/Store Queue (2/2) ROB regfile I$ B P load data store data L1-D addr load/store LSQ Almost a “real” processor diagram

Loads Execute When … Most aggressive approach
Relies on fact that storeload forwarding is rare Greatest potential IPC – loads never stall Potential for incorrect execution Need to be able to “undo” bad loads Perhaps good to have discussion here about when forwarding is unlikely vs. likely. ISA dependence? Program structures?

Detecting Ordering Violations
Case 1: Older store execs before younger load No problem; if same address stld forwarding happens Case 2: Older store execs after younger load Store scans all younger loads Address match  ordering violation

Loads Checking for Earlier Stores
On Load dispatch, find data from earlier Store Address Bank Data Bank Valid store = Use this store Addr match ST 0x4000 = No earlier matches = ST 0x4000 = = Need to adjust this so that load need not be at bottom, and LSQ can wrap-around ST 0x4120 = As mentioned in the earlier notes, the real circuitry gets quite a bit messier when you have to deal with different memory widths and/or unaligned accesses. = LD 0x4000 If |LSQ| is large, logic can be adapted to have log delay

Data Forwarding This is ugly, complicated, slow, and power hungry
On execute Store (STA+STD), check for later Loads Overwritten Data Bank Similar Logic to Previous Slide Capture Value ST 0x4000 Is Load Addr Match ST 0x4120 LD 0x4000 Overwritten This logic must handle the situation where more than one store writes to the same address. The load should only pick up a value from the most recent matching store… which may hard to tell if some store addresses have not yet been computed. ST 0x4000 This is ugly, complicated, slow, and power hungry

Data-Capture Scheduler
Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When ready, operands sent directly from scheduler to functional units Fetch & Dispatch ARF PRF/ROB Physical register update Data-Capture Scheduler Bypass P-Pro family processors use data-capture-style schedulers. Dispatch is usually the same as the Allocate (or just “alloc”) stage(s). Functional Units

Scheduling Loop or Wakeup-Select Loop
Wake-Up Part: Executing insn notifies dependents Waiting insns. check if all deps are satisfied If yes, “wake up” instutrction Select Part: Choose which instructions get to execute More than one insn. can be ready Number of functional units and memory ports are limited

Interaction with Execution
Payload RAM Select Logic D SL SR A opcode ValL ValR ValL ValR ValL ValR ValL ValR D = Destination Tag, SL = Left Source Tag, SR = Right Source Tag, ValL = Left Operand Value, ValR = Right Operand Value The scheduler is typically broken up into the CAM-based scheduling part, and a RAM-based “payload” part that holds the actual values (and instruction opcode and any other information required for execution) that get sent to the actual functional units/ALUs.

Simple Scheduler Pipeline
A B A: Select Payload Execute result broadcast C tag broadcast enable capture on tag match B: Wakeup Capture Select Payload Execute tag broadcast enable capture C: Wakeup Capture Simple case with minimal pipelining; dependent instructions can execute in back-to-back cycles, but the achievable clock speed will be slow because each cycle contains too much work (i.e., select, payload read, execute, bypass and capture). Cycle i Cycle i+1 Very long clock cycle

Deeper Scheduler Pipeline
A B A: Select Payload Execute result broadcast C tag broadcast enable capture B: Wakeup Capture Select Payload Execute tag broadcast enable capture C: Wakeup Capture Faster clock speed Select Payload Execute Cycle i Cycle i+1 Cycle i+2 Cycle i+3 Faster, but Capture & Payload on same cycle

Very Deep Scheduler Pipeline
A B C A: Select Payload Execute D B: Select Payload Execute A&B both ready, only A selected, B bids again AC and CD must be bypassed, BD OK without bypass Wakeup Capture C: Wakeup Capture Select Payload Execute D: Very aggressive pipelining, but now with a greater IPC penalty due to not being able to issue dependent instructions in back-to-back cycles. Good segue to the many research papers on aggressive and/or speculative pipelining of the scheduler (quite a few of these in ISCA/MICRO/HPCA the early 2000’s). Wakeup Capture Select Payload Execute Cycle i i+1 i+2 i+3 i+4 i+5 i+6 Dependent instructions can’t execute back-to-back

Non-Data-Capture Scheduler
Fetch & Dispatch Fetch & Dispatch Scheduler Scheduler ARF PRF Unified PRF Physical register update Physical register update Functional Units Functional Units

Substantial increase in schedule-to-execute latency
Pipeline Timing S X E “Skip” Cycle Data-Capture Select Payload Execute Wakeup Select Payload Execute Select Payload Read Operands from PRF Wakeup Execute Exec S X E Non-Data-Capture The idea of a “skip” cycle is just to abstract away any work that has to be done between schedule and execute. This could be payload RAM reading, picking up values from the bypass bus, reading values from the physical register file, or simple wire delay to get the data from the scheduler logic all the way over to the execution units. This example assumes a two-cycle PRF read latency: You read out the register identifier (e.g., “P13”) from the payload RAM, and then that is used as an index into the PRF to read the actual data value. Substantial increase in schedule-to-execute latency

Handling Multi-Cycle Instructions
Sched PayLd Exec Add R1 = R2 + R3 WU Sched PayLd Exec Xor R4 = R1 ^ R5 Sched PayLd Exec Mul R1 = R2 × R3 Sched PayLd Exec Add R4 = R1 + R5 WU Instructions can’t execute too early

Non-Deterministic Latencies
Real situations have unknown latency Load instructions Latency  {L1_lat, L2_lat, L3_lat, DRAM_lat} DRAM_lat is not a constant either, queuing delays Architecture specific cases PowerPC 603 has “early out” for multiplication Intel Core 2’s has early out divider also

What to do on a cache miss?
Load-Hit Speculation Caches work pretty well Hit rates are high (otherwise we wouldn’t use caches) Assume all loads hit in the cache Sched PayLd Exec Exec Exec Cache hit, data forwarded R1 = 16[$sp] Broadcast delayed by DL1 latency R2 = R1 + #4 Sched PayLd Exec What to do on a cache miss?

Simple Select Logic 1 O(log S) gates Scheduler Entries S entries
Grant0 = 1 Grant1 = !Bid0 Grant2 = !Bid0 & !Bid1 Grant3 = !Bid0 & !Bid1 & !Bid2 Grantn-1 = !Bid0 & … & !Bidn-2 S entries yields O(S) gate delay O(log S) gates granti 1 x0 x1 x2 x3 x4 x5 x6 x7 x8 grant0 xi = Bidi grant1 grant2 grant3 grant4 grant5 This just selects the first ready instruction, where “first” is simply determined by physical location in the scheduler (top entry has highest priority). In this example, a grant may be seen by an entry that is not ready in which case the grant is simply ignored (the circuit will ensure that only one ready entry will ever receive a grant). grant6 grant7 grant8 grant9 Scheduler Entries

Implementing Oldest First Select
6 Grant 3 ∞ 2 A 2 F 5 D 3 B 1 H 7 C 2 Each box in the select logic is a MIN operation, and passes the lower timestamp (older instruction) onward to the right. Non-ready instruction effectively present a timestamp of ∞ to the select logic. At the root of the tree, the last timestamp is that of the oldest AND ready instruction. E 4 Age-Aware Select Logic Must broadcast grant age to instructions

Problems in N-of-M Select
Age-Aware 1-of-M N layers  O(N log M) delay O(log M) gate delay / select Age-Aware 1-of-M ∞ Age-Aware 1-of-M ∞ G 6 A F 5 D 3 B 1 H 7 C 2 This is a serial selection… Everyone bids to the first select logic. If you’re not selected, then you continue bidding on the next select logic circuit (if you are selected, then your timestamp gets changed to ∞ to indicate that you don’t need to bid anymore). Each select logic has O(log M) gate delay assuming a tree-like implementation from the last lecture, but we have N such circuits in series. E 4

Select Binding Wasted Resources: 3 instructions are ready
(Idle) Wasted Resources: 3 instructions are ready Only 1 gets to issue Select Logic for ALU1 Select Logic for ALU2 XOR SUB 2 8 Select Logic for ALU1 Select Logic for ALU2 1 ADD 4 5 CMP 3 ADD 5 1 XOR 2 2 Not-Quite-Oldest-First: Ready insns are aged 2, 3, 4 Issued insns are 2 and 4 SUB 8 1 ADD 4 1 CMP 3 2 Bob Colwell’s chapter in Shen and Lipasti’s book indicate that they used some sort of load-balancing approach for assigninging select ports to instructions. This could probably be done by assigning an instruction to a valid port with the least number of unexecuted instructions already bound to that port. This doesn’t guarantee that the situation depicted on the right can’t happen, but it hopefully reduces the frequency.

Execution Ports Divide functional units into P groups
Called “ports” Area only O(P2M log M), where P << F Logic for tracking bids and grants less complex (deals with P sets) Shift Load Store FM/D SIMD ALU1 ALU2 ALU3 M/D FAdd ADD 3 LOAD 5 ADD 2 MUL 8

Decentralized RS Natural split: INT vs. FP L1 Data Cache Int Cluster
FP Cluster Often implies non-ROB based physical register file: One “unified” integer PRF, and one “unified” FP PRF, each managed separately with their own free lists INT RF FP Store Load FP-Ld FP-St FP-only wakeup Int-only wakeup ALU1 ALU2 FAdd FM/D This picture assumes no direct INT-to-FP move instructions (or the other way around). Port 0 Port 1 Port 2 Port 3

Higher Complexity not Worth Effort
Performance Made sense to go Superscalar/OOO: good ROI Very little gain for substantial effort “Effort” Scalar In-Order Moderate-Pipe Superscalar/OOO Very-Deep-Pipe Aggressive Superscalar/OOO

SMP Machines SMP = Symmetric Multi-Processing OS seems multiple CPUs
Symmetric = All CPUs have “equal” access to memory OS seems multiple CPUs Runs one process (or thread) on each CPU CPU0 CPU1 CPU2 CPU3

MP Workload Benefits runtime Task A Task B Task A Task B Benefit
3-wide OOO CPU 4-wide OOO CPU Task A Task B Benefit 3-wide OOO CPU Task A Task B Just showing that with parallelism, even two “smaller” cores may provide better overall performance than one “regular” core (and the smaller cores are likely to be cheaper from an area and power standpoint due to how these tend to grow super-linearly). 2-wide OOO CPU Task B Task A

… If Only One Task Available
runtime Task A 3-wide OOO CPU 4-wide OOO CPU Task A Benefit Task A 3-wide OOO CPU 3-wide OOO CPU No benefit over 1 CPU But you’re stuck if you care about single-thread performance Task A 2-wide OOO CPU 2-wide OOO CPU Performance degradation! Idle

Chip-Multiprocessing (CMP)
Simple SMP on the same chip CPUs now called “cores” by hardware designers OS designers still call these “CPUs” Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX

On-chip Interconnects (1/4)
Today, (Core+L1+L2) = “core” (L3+I/O+Memory) = “uncore” How to interconnect multiple “core”s to “uncore”? Possible topologies Bus Crossbar Ring Mesh Torus Core $ LLC $ Memory Controller

Possible topologies Bus Crossbar Ring Mesh Torus Core $ $ Bank 0 $ Bank 1 $ Bank 2 $ Bank 3 Memory Controller Oracle UltraSPARC T5 (3.6GHz, 16 cores, 8 threads per core)

Possible topologies Bus Crossbar Ring Mesh Torus Core $ Memory Controller $ Bank 0 $ Bank 1 $ Bank 2 $ Bank 3 3 ports per switch Simple and cheap Can be bi-directional to reduce latency Intel Sandy Bridge (3.5GHz, 6 cores, 2 threads per core)

Possible topologies Bus Crossbar Ring Mesh Torus Core $ Bank 1 Bank 0 Bank 4 Bank 3 Memory Controller Bank 2 Bank 7 Bank 6 Bank 5 Tilera Tile64 (866MHz, 64 cores) Up to 5 ports per switch Tiled organization combines core and cache

Multi-Threading Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC
Poor utilization of transistors SMP: 2-4 CPUs, but need independent threads Poor utilization as well (if limited tasks) {Coarse-Grained,Fine-Grained,Simultaneous}-MT Use single large uni-processor as a multi-processor Core provide multiple hardware contexts (threads) Per-thread PC Per-thread ARF (or map table) Each core appears as multiple CPUs OS designers still call these “CPUs”

Dependencies limit functional unit utilization
Scalar Pipeline Time To motivate how we came to incorporate multithreading into EV8, let’s return to the early 1980’s and take a very high level view of instruction execution. This very abstract diagram illustrates the activity in just the execute stage of a single issue machine, with the red boxes showing instruction execution. Note gaps due to multiple cycle operation latency and inter-instruction dependencies which result in less than perfect utilization of the function units. Dependencies limit functional unit utilization

Higher performance than scalar, but lower utilization
Superscalar Pipeline Time If that weren’t bad enough, for more performance we now make parallel (or wide) issue to use more function units in a cycle. Even with sophisticated techniques like out-of-order (or dynamic) issue, and sophisticated branch prediction, this leads to more waste, since there aren’t always enough instructions to issue in a cycle. Peak sustainable execution rate – = 4, a wide variety of studies have shown this function unit unitization to be under 50%, and getting worse as machines continue to get wider. Note I’m not saying that its bad to keep on this trajectory, because continuing to make the machine wider is still providing an absolute single stream performance benefit. Furthermore the key point is not keeping FUs busy, because they really aren’t that expensive, but because they represent more work getting done. So what can we do…. Higher performance than scalar, but lower utilization

Chip Multiprocessing (CMP)
Time Another approach is to get more efficiency by using two smaller CPUs. But this sacrifices single stream performance. And Amdahl’s law tells us that sometimes you have parallelism and sometimes you don’t so one can’t always use multiple streams. This just means you still will see reduced function unit utilization. Limited utilization when running one thread

Coarse-Grained Multithreading
Time Hardware Context Switch Another approach is to get more efficiency by using two smaller CPUs. But this sacrifices single stream performance. And Amdahl’s law tells us that sometimes you have parallelism and sometimes you don’t so one can’t always use multiple streams. This just means you still will see reduced function unit utilization. Only good for long latency ops (i.e., cache misses)

Fine-Grained Multithreading
Time Saturated workload -> Lots of threads Unsaturated workload -> Lots of stalls Classic answer to problem of dependencies is to take instructions from multiple threads. Still leaves a lot of wasted slots. Some varients of this style of multthreading result in a design where where every thread goes much slower. I actually worked on this style of multithreading for my Ph.D ages ago, but had abandoned the idea until now. Intra-thread dependencies still limit performance

Simultaneous Multithreading
Time What changed my mind was the work by Dean Tullsen at U. Washington who addressed waste issue, by proposing simultaneous multithreading. What he suggested was simply using any available slot for any available thread. But his work was somewhat incomplete, so we then collaborated to work on achieving the goal of uncompromsed single stream performance - and multithreading... Max utilization of functional units

Paired vs. Separate Processor/Memory?
Separate CPU/memory Uniform memory access (UMA) Equal latency to memory Low peak performance Paired CPU/memory Non-uniform memory access (NUMA) Faster local memory Data placement matters High peak performance CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem Mem Mem Mem

Issues for Shared Memory Systems
Two big ones Cache coherence Memory consistency model Closely related Often confused

Cache Coherence: The Problem
Variable A initially has value 0 P1 stores value 1 into A P2 loads A from memory and sees old value 0 P1 P2 t1: Store A=1 t2: Load A? A: 0 A: 0 1 L1 A: 0 L1 Bus A: 0 Main Memory Need to do something to keep P2’s cache coherent

Simple MSI Protocol Usable coherence protocol Cache Actions:
Load / BusRd BusRd / [BusReply] Invalid Shared Load / -- BusRdX, BusInv / [BusReply] Evict / -- Store / BusRdX BusRd / BusReply BusRdX / BusReply Evict / BusWB Cache Actions: Load, Store, Evict Bus Actions: BusRd, BusRdX BusInv, BusWB, BusReply Store / BusInv Modified Load, Store / -- Usable coherence protocol 9

Coherence vs. Consistency
Coherence concerns only one memory location Consistency concerns ordering for all locations A Memory System is Coherent if Can serialize all operations to that location Operations performed by any core appear in program order Read returns value written by last store to that location A Memory System is Consistent if It follows the rules of its Memory Model Operations on memory locations appear in some defined order 3

Sequential Consistency (SC)
processors issue memory ops in program order P3 P1 P2 switch randomly set after each memory op Memory Defines Single Sequential Order Among All Ops. 5

Mutex Example w/ Store Buffer
Shared Bus P1 Read B t1 t3 P2 Read A t2 t4 A: 0 B: 0 Write A Write B P P2 lockA: A = 1; lockB: B=1; if (B != 0) if (A != 0) { A = 0; goto lockA; } { B = 0; goto lockB; } /* critical section*/ /* critical section*/ A = 0; B = 0; Does not work

Relaxed Consistency Models
The University of Adelaide, School of Computer Science 1 April 2017 Relaxed Consistency Models Sequential Consistency (SC): R → W, R → R, W → R, W → W Total Store Ordering (TSO) relaxes W → R R → W, R → R, W → W Partial Store Ordering relaxes W → W (coalescing WB) R → W, R → R Weak Ordering or Release Consistency (RC) All ordering explicitly declared Use fences to define boundaries Use acquire and release to force flushing of values X → Y X must complete before Y Chapter 2 — Instructions: Language of the Computer

Good Luck!

CSE 502: Computer Architecture

Similar presentations

Presentation on theme: "CSE 502: Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 502: Computer Architecture

Similar presentations

Presentation on theme: "CSE 502: Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback