Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

Similar presentations


Presentation on theme: "CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)"— Presentation transcript:

1 CSE502: Computer Architecture Review

2 CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat) new course. Computer Architecture is … the science and art of selecting and interconnecting hardware and software components to create computers …

3 CSE502: Computer Architecture Course Overview (2/2) This course is hard, roughly like CSE 506 – In CSE 506, you learn whats inside an OS – In CSE 502, you learn whats inside a CPU This is a project course – Learn why things are the way they are, first hand – We will build emulators of CPU components

4 CSE502: Computer Architecture Policy and Projects Probably different from other classes – Much more open, but much more strict Most people followed the policy Some did not – Resembles the real world Youre here because you want to learn and to be here If you managed to get your partner(s) to do the work – Youre probably good enough to do it at your job too » The good: You might make a good manager » The bad: You didnt learn much Time mgmt. often more important than tech. skill – If you started early, you probably have an A already

5 CSE502: Computer Architecture 1 time orig f(1 - f) time orig f(1 - f) time orig Amdahls Law Speedup = time without enhancement / time with enhancement An enhancement speeds up fraction f of a task by factor S time new = time orig ·( (1-f) + f/S ) S overall = 1 / ( (1-f) + f/S ) (1 - f) time new f/S (1 - f) time new f/S

6 CSE502: Computer Architecture The Iron Law of Processor Performance Architects target CPI, but must understand the others Total Work In Program Total Work In Program CPI or 1/IPC 1/f (frequency) Algorithms, Compilers, ISA Extensions Algorithms, Compilers, ISA Extensions Microarchitecture Microarchitecture, Process Tech Microarchitecture, Process Tech

7 CSE502: Computer Architecture Averaging Performance Numbers (2/2) Arithmetic: times – proportional to time – e.g., latency Harmonic: rates – inversely proportional to time – e.g., throughput Geometric: ratios – unit-less quantities – e.g., speedups

8 CSE502: Computer Architecture Power vs. Energy Power: instantaneous rate of energy transfer – Expressed in Watts – In Architecture, implies conversion of electricity to heat – Power(Comp1+Comp2)=Power(Comp1)+Power(Comp2) Energy: measure of using power for some time – Expressed in Joules – power * time (joules = watts * seconds) – Energy(OP1+OP2)=Energy(OP1)+Energy(OP2) What uses power in a chip?

9 CSE502: Computer Architecture ISA: A contract between HW and SW ISA: Instruction Set Architecture – A well-defined hardware/software interface The contract between software and hardware – Functional definition of operations supported by hardware – Precise description of how to invoke all features No guarantees regarding – How operations are implemented – Which operations are fast and which are slow (and when) – Which operations take more energy (and which take less)

10 CSE502: Computer Architecture Components of an ISA Programmer-visible states – Program counter, general purpose registers, memory, control registers Programmer-visible behaviors – What to do, when to do it A binary encoding if imem[pc]==add rd, rs, rt then pc pc+1 gpr[rd]=gpr[rs]+grp[rt] Example register-transfer-level description of an instruction ISAs last forever, dont add stuff you dont need

11 CSE502: Computer Architecture Locality Principle Recent past is a good indication of near future Spatial Locality: If you looked something up, it is very likely you will look up something nearby soon Temporal Locality: If you looked something up, it is very likely that you will look it up again soon

12 CSE502: Computer Architecture Caches An automatically managed hierarchy Break memory into blocks (several bytes) and transfer data to/from cache in blocks – spatial locality Keep recently accessed blocks – temporal locality Core $ $ Memory

13 CSE502: Computer Architecture = = = = = = Keep blocks in cache frames – data – state (e.g., valid) – address tag data Fully-Associative Cache multiplexor tag[63:6]block offset[5:0] address What happens when the cache runs out of space? tag state = = 063 hit?

14 CSE502: Computer Architecture The 3 Cs of Cache Misses Compulsory: Never accessed before Capacity: Accessed long ago and already replaced Conflict: Neither compulsory nor capacity Coherence: (In multi-cores, become owner to write)

15 CSE502: Computer Architecture Cache Size Cache size is data capacity (dont count tag and state) – Bigger can exploit temporal locality better – Not always better Too large a cache – Smaller is faster bigger is slower – Access time may hurt critical path Too small a cache – Limited temporal locality – Useful data constantly replaced hit rate working set size capacity

16 CSE502: Computer Architecture Block Size Block size is the data that is – Associated with an address tag – Not necessarily the unit of transfer between hierarchies Too small a block – Dont exploit spatial locality well – Excessive tag overhead Too large a block – Useless data transferred – Too few total blocks Useful data frequently replaced hit rate block size

17 CSE502: Computer Architecture Use middle bits as index Only one tag comparison data tag data tag state Direct-Mapped Cache multiplexor tag[63:16]index[15:6]block offset[5:0] = = decoder tag match (hit?)

18 CSE502: Computer Architecture N-Way Set-Associative Cache tag[63:15]index[14:6]block offset[5:0] tag multiplexor decoder = = hit? data tag data tag state multiplexor decoder = = multiplexor way set Note the additional bit(s) moved from index to tag data state

19 CSE502: Computer Architecture Associativity Larger associativity – lower miss rate (fewer conflicts) – higher power consumption Smaller associativity – lower cost – faster hit time ~5 for L1-D hit rate associativity

20 CSE502: Computer Architecture Parallel vs Serial Caches Tag and Data usually separate (tag is smaller & faster) – State bits stored along with tags Valid bit, LRU bit(s), … hit? = = = = = = = = valid? datahit? = = = = = = = = valid? data enable Parallel access to Tag and Data reduces latency (good for L1) Serial access to Tag and Data reduces power (good for L2+)

21 CSE502: Computer Architecture Physically-Indexed Caches Core requests are VAs Cache index is PA[15:6] – VA passes through TLB – D-TLB on critical path Cache tag is PA[63:16] If index size < page size – Can use VA for index tag[63:14]index[13:6]block offset[5:0] Virtual Address virtual page[63:13]page offset[12:0] / index[6:0] / physical tag[51:1] physical index[7:0] / = = = = = = = = D-TLB / physical index[0:0]

22 CSE502: Computer Architecture Virtually-Indexed Caches Core requests are VAs Cache index is VA[15:6] Cache tag is PA[63:16] Why not tag with VA? – Cache flush on ctx switch Virtual aliases – Ensure they dont exist – … or check all on miss tag[63:14]index[13:6]block offset[5:0] Virtual Address virtual page[63:13]page offset[12:0] / virtual index[7:0] D-TLB / physical tag[51:0] = = = = = = = = One bit overlaps

23 CSE502: Computer Architecture Inclusion Core often accesses blocks not present on chip – Should block be allocated in L3, L2, and L1? Called Inclusive caches Waste of space Requires forced evict (e.g., force evict from L1 on evict from L2+) – Only allocate blocks in L1 Called Non-inclusive caches (who not exclusive?) Must write back clean lines Some processors combine both – L3 is inclusive of L1 and L2 – L2 is non-inclusive of L1 (like a large victim cache)

24 CSE502: Computer Architecture Parity & ECC Cosmic radiation can strike at any time – Especially at high altitude – Or during solar flares What can be done? – Parity 1 bit to indicate if sum is odd/even (detects single-bit errors) – Error Correcting Codes (ECC) 8 bit code per 64-bit word Generally SECDED (Single-Error-Correct, Double-Error-Detect) Detecting errors on clean cache lines is harmless – Pretend its a cache miss

25 CSE502: Computer Architecture SRAM vs. DRAM SRAM = Static RAM – As long as power is present, data is retained DRAM = Dynamic RAM – If you dont do anything, you lose the data SRAM: 6T per bit – built with normal high-speed CMOS technology DRAM: 1T per bit (+1 capacitor) – built with special DRAM process optimized for density

26 CSE502: Computer Architecture DRAM Chip Organization Low-Level organization is very similar to SRAM Cells are only single-ended – Reads destructive: contents are erased by reading Row buffer holds read data – Data in row buffer is called a DRAM row Often called page - not necessarily same as OS page – Read gets entire row into the buffer – Block reads always performed out of the row buffer Reading a whole row, but accessing one block Similar to reading a cache line, but accessing one word

27 CSE502: Computer Architecture DIMM DRAM Organization DRAM x8 DRAM DRAM Rank Dual-rank x8 (2Rx8) DIMM x8 DRAM Bank All banks within the rank share all address and control pins All banks within the rank share all address and control pins x8 means each DRAM outputs 8 bits, need 8 chips for DDRx (64-bit) x8 means each DRAM outputs 8 bits, need 8 chips for DDRx (64-bit) All banks are independent, but can only talk to one bank at a time All banks are independent, but can only talk to one bank at a time Why 9 chips per rank? 64 bits data, 8 bits ECC Why 9 chips per rank? 64 bits data, 8 bits ECC DRAM

28 CSE502: Computer Architecture AMAT with MLP If … cache hit is 10 cycles (core to L1 and back) memory access is 100 cycles (core to mem and back) Then … at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55 Unless MLP is >1.0, then… at 50% mr,1.5 MLP,avg. access:(0.5×10+0.5×100)/1.5 = 37 at 50% mr,4.0 MLP,avg. access:(0.5×10+0.5×100)/4.0 = 14 In many cases, MLP dictates performance

29 CSE502: Computer Architecture Memory Controller Memory Controller (1/2) Scheduler Buffer Channel 0Channel 1 Commands Data Read Queue Write Queue Response Queue To/From CPU

30 CSE502: Computer Architecture Memory Controller (2/2) Memory controller connects CPU and DRAM Receives requests after cache misses in LLC – Possibly originating from multiple cores Complicated piece of hardware, handles: – DRAM Refresh – Row-Buffer Management Policies – Address Mapping Schemes – Request Scheduling

31 CSE502: Computer Architecture Address Mapping Schemes Example Open-page Mapping Scheme: High Parallelism: [row rank bank column channel offset] Easy Expandability: [c hannel rank row bank column offset] Example Close-page Mapping Scheme: High Parallelism:[ row column rank bank channel offset] Easy Expandability:[ channel rank row column bank offset]

32 CSE502: Computer Architecture Memory Request Scheduling Write buffering – Writes can wait until reads are done Queue DRAM commands – Usually into per-bank queues – Allows easily reordering ops. meant for same bank Common policies: – First-Come-First-Served (FCFS) – First-ReadyFirst-Come-First-Served (FR-FCFS)

33 CSE502: Computer Architecture Prefetching (1/2) Fetch block ahead of demand Target compulsory, capacity, (& coherence) misses – Not conflict: prefetched block would conflict Big challenges: – Knowing what to fetch Fetching useless blocks wastes resources – Knowing when to fetch Too early clutters storage (or gets thrown out before use) Fetching too late defeats purpose of pre-fetching

34 CSE502: Computer Architecture Without prefetching: With prefetching: Or: Prefetch Prefetching (2/2) Load L1L2 Data DRAM Total Load-to-Use Latency Data Load Much improved Load-to-Use Latency Somewhat improved Latency Data Load Prefetching must be accurate and timely time

35 CSE502: Computer Architecture Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 (or X^0x1) – Assumes spatial locality Often a good assumption – Should stop at physical (OS) page boundaries Can often be done efficiently – Adjacent-line is convenient when next-level block is bigger – Prefetch from DRAM can use bursts and row-buffer hits Works for I$ and D$ – Instructions execute sequentially – Large data structures often span multiple blocks Simple, but usually not timely

36 CSE502: Computer Architecture Next-N-Line Prefetching On request for line X, prefetch X+1, X+2, …, X+N – N is called prefetch depth or prefetch degree Must carefully tune depth N. Large N is … – More likely to be useful (correct and timely) – More aggressive more likely to make a mistake Might evict something useful – More expensive need storage for prefetched lines Might delay useful request on interconnect or port Still simple, but more timely than Next-Line

37 CSE502: Computer Architecture Stride Prefetching Access patterns often follow a stride – Accessing column of elements in a matrix – Accessing elements in array of struct s Detect stride S, prefetch depth N – Prefetch X+1S, X+2S, …, X+NS Column in matrix Elements in array of struct s

38 CSE502: Computer Architecture Localized Stride Prefetchers Store PC, last address, last stride, and count in RPT On access, check RPT (Reference Prediction Table) – Same stride? count++ if yes, count-- or count=0 if no – If count is high, prefetch (last address + stride*N) PCa: 0x409A34Load R1 = [R2] PCb: 0x409A38Load R3 = [R4] PCc: 0x409A40Store [R6] = R5 0x409 TagLast AddrStrideCount 0x409 A+3N N N 2 2 X+3N N N 2 2 Y+2N N N 1 1 If confident about the stride (count > C min ), prefetch (A+4N) + +

39 CSE502: Computer Architecture Evaluating Prefetchers Compare against larger caches – Complex prefetcher vs. simple prefetcher with larger cache Primary metrics – Coverage: prefetched hits / base misses – Accuracy: prefetched hits / total prefetches – Timeliness: latency of prefetched blocks / hit latency Secondary metrics – Pollution: misses / (prefetched hits + base misses) – Bandwidth: total prefetches + misses / base misses – Power, Energy, Area...

40 CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1) – Long clock period (to accommodate slowest instruction) Multi-cycle control: micro-programmed – Short clock period – High CPI Can we have both low CPI and short clock period? Single-cycle Multi-cycle insn0.(fetch,decode,exec)insn1.(fetch,decode,exec) insn0.decinsn0.fetchinsn1.decinsn1.fetchinsn0.execinsn1.exec time

41 CSE502: Computer Architecture Pipelining Start with multi-cycle design When insn0 goes from stage 1 to stage 2 … insn1 starts stage 1 Each instruction passes through all stages … but instructions enter and leave at faster rate Multi-cycle insn0.decinsn0.fetchinsn1.decinsn1.fetchinsn0.execinsn1.exec time Pipelined insn0.execinsn0.decinsn0.fetch insn1.decinsn1.fetchinsn1.exec insn2.decinsn2.fetchinsn2.exec Can have as many insns in flight as there are stages

42 CSE502: Computer Architecture Instruction Dependencies Data Dependence – Read-After-Write (RAW) (only true dependence) Read must wait until earlier write finishes – Anti-Dependence (WAR) Write must wait until earlier read finishes (avoid clobbering) – Output Dependence (WAW) Earlier write cant overwrite later write Control Dependence (a.k.a. Procedural Dependence) – Branch condition must execute before branch target – Instructions after branch cannot run before branch

43 CSE502: Computer Architecture Pipeline Terminology Pipeline Hazards – Potential violations of program dependencies – Must ensure program dependencies are not violated Hazard Resolution – Static method: performed at compile time in software – Dynamic method: performed at runtime using hardware – Two options: Stall (costs perf.) or Forward (costs hw.) Pipeline Interlock – Hardware mechanism for dynamic hazard resolution – Must detect and enforce dependencies at runtime

44 CSE502: Computer Architecture Simple 5-stage Pipeline PC Inst Cache Inst Cache Register file MUXMUX MUXMUX 1 1 Data Cache Data Cache MUXMUX MUXMUX IF/IDID/EXEX/MemMem/WB MUXMUX MUXMUX op dest offset valB valA PC+1 target ALU result ALU result op dest valB op dest ALU result ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data dest MUXMUX MUXMUX

45 CSE502: Computer Architecture Balancing Pipeline Stages Coarser-Grained Machine Cycle: 4 machine cyc / instruction Finer-Grained Machine Cycle: 11 machine cyc /instruction T IF&ID = 8 units T OF = 9 units T EX = 5 units T OS = 9 units IFIF IDID OFOF WBWB EXEX # stages = 11 T cyc = 3 units IFIF IFIF IDID OFOF OFOF OFOF EXEX EXEX WBWB WBWB WBWB # stages = 4 T cyc = 9 units

46 CSE502: Computer Architecture IPC vs. Frequency 10-15% IPC not bad if frequency can double Frequency doesnt double – Latch/pipeline overhead – Stage imbalance 1000ps 500ps 2.0 IPC, 1GHz 1.7 IPC, 2GHz 2 BIPS3.4 BIPS 900ps 450ps 900ps GHz

47 CSE502: Computer Architecture Architectures for Instruction Parallelism Scalar pipeline (baseline) – Instruction/overlap parallelism = D – Operation Latency = 1 – Peak IPC = 1.0 D Successive Instructions Time in cycles D different instructions overlapped

48 CSE502: Computer Architecture Superscalar Machine Superscalar (pipelined) Execution – Instruction parallelism = D x N – Operation Latency = 1 – Peak IPC = N per cycle Successive Instructions Time in cycles N D x N different instructions overlapped

49 CSE502: Computer Architecture RISC ISA Format Fixed-length – MIPS all insts are 32-bits/4 bytes Few formats – MIPS has 3: R (reg, reg, reg), I (reg, reg, imm), J (addr) – Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP Regularity across formats (when possible/practical) – MIPS & Alpha opcode in same bit-position for all formats – MIPS rs & rt fields in same bit-position for R and I formats – Alpha ra/fa field in same bit-position for all 5 formats

50 CSE502: Computer Architecture Superscalar Decode for RISC ISAs Decode X insns. per cycle (e.g., 4-wide) – Just duplicate the hardware – Instructions aligned at 32-bit boundaries 32-bit inst Decoder decoded inst decoded inst scalar Decoder 32-bit inst Decoder decoded inst decoded inst superscalar 4-wide superscalar fetch 32-bit inst decoded inst decoded inst decoded inst decoded inst decoded inst decoded inst 1-Fetch

51 CSE502: Computer Architecture CISC ISA RISC focus on fast access to information – Easy decode, I$, large RFs, D$ CISC focus on max expressiveness per min space – Designed in era with fewer transistors, chips – Each memory access very expensive Pack as much work into as few bytes as possible More expressive instructions – Better potential code generation in theory – More complex code generation in practice

52 CSE502: Computer Architecture ModeExampleMeaning RegisterADD R4, R3, R2R4 = R3 + R2 ADD in RISC ISA

53 CSE502: Computer Architecture ModeExampleMeaning RegisterADD R4, R3R4 = R4 + R3 ImmediateADD R4, #3R4 = R4 + 3 DisplacementADD R4, 100(R1)R4 = R4 + Mem[100+R1] Register IndirectADD R4, (R1)R4 = R4 + Mem[R1] Indexed/BaseADD R3, (R1+R2)R3 = R3 + Mem[R1+R2] Direct/AbsoluteADD R1, (1234)R1 = R1 + Mem[1234] Memory IndirectADD = R1 + Mem[Mem[R3]] Auto-IncrementADD R1,(R2)+R1 = R1 + Mem[R2]; R2++ Auto-DecrementADD R1, -(R2)R2--; R1 = R1 + Mem[R2] ADD in CISC ISA

54 CSE502: Computer Architecture RISC (MIPS) vs CISC (x86) lui R1, Disp[31:16] ori R1, R1, Disp[15:0] add R1, R1, R2 shli R3, R3, 3 add R3, R3, R1 lui R1, Imm[31:16] ori R1, R1, Imm[15:0] st [R3], R1 MOV [EBX+EAX*8+Disp], Imm 8 insns. at 32 bits each vs 1 insn. at 88 bits: 2.9x!

55 CSE502: Computer Architecture x86 Encoding Basic x86 Instruction: Prefixes 0-4 bytes Prefixes 0-4 bytes Opcode 1-2 bytes Opcode 1-2 bytes Mod R/M 0-1 bytes Mod R/M 0-1 bytes SIB 0-1 bytes SIB 0-1 bytes Displacement 0/1/2/4 bytes Displacement 0/1/2/4 bytes Immediate 0/1/2/4 bytes Immediate 0/1/2/4 bytes Longest Inst 15 bytesShortest Inst: 1 byte Opcode has flag indicating Mod R/M is present –Most instructions use the Mod R/M byte –Mod R/M specifies if optional SIB byte is used –Mod R/M and SIB may specify additional constants Instruction length not known until after decode

56 CSE502: Computer Architecture Instruction Cache Organization To fetch N instructions per cycle... – L1-I line must be wide enough for N instructions PC register selects L1-I line A fetch group is the set of insns. starting at PC – For N-wide machine, [PC,PC+N-1] Decoder Tag Inst Tag Inst Tag Inst Tag Inst Tag Inst Cache Line PC

57 CSE502: Computer Architecture Fetch Misalignment Now takes two cycles to fetch N instructions Decoder Tag Inst Tag Inst Tag Inst Tag Inst Tag Inst PC: xxx Decoder Tag Inst Tag Inst Tag Inst Tag Inst Tag Inst PC: xxx Cycle 1 Cycle 2

58 CSE502: Computer Architecture Fragmentation due to Branches Fetch group is aligned, cache line size > fetch group – Taken branches still limit fetch width Decoder Tag Inst Tag InstBranchInst Tag Inst Tag Inst Tag Inst XX

59 CSE502: Computer Architecture Types of Branches Direction: – Conditional vs. Unconditional Target: – PC-encoded PC-relative Absolute offset – Computed (target derived from register) Need direction and target to find next fetch group

60 CSE502: Computer Architecture Branch Prediction Overview Use two hardware predictors – Direction predictor guesses if branch is taken or not-taken – Target predictor guesses the destination PC Predictions are based on history – Use previous behavior as indication of future behavior – Use historical context to disambiguate predictions

61 CSE502: Computer Architecture Direction vs. Target Prediction Direction: 0 or 1 Target: 32- or 64-bit value Turns out targets are generally easier to predict – Dont need to predict N-t target – T target doesnt usually change Only need to predict taken-branch targets Prediction is really just a cache – Branch Target Buffer (BTB) Target Pred Target Pred + + sizeof(inst) PC

62 CSE502: Computer Architecture Branch Target Buffer ( BTB ) V V BIA BTA Branch PC Branch Target Address Branch Target Address = = Valid Bit Hit? Branch Instruction Address (Tag) Branch Instruction Address (Tag) Next Fetch PC

63 CSE502: Computer Architecture BTB w/Partial Tags cfff cfff cfff984c v v cfff cfff9704 v v cfff cfff9830 v v cfff cfff cfff cfff cfff984c v v f cfff9704 v v f cfff9830 v v f cfff beef9810 Fewer bits to compare, but prediction may alias

64 CSE502: Computer Architecture BTB w/PC - offset Encoding cfff984c v v f cfff9704 v v f cfff9830 v v f cfff cfff984c v v f981 ff9704 v v f982 ff9830 v v f984 ff cf ff9900 If target too far or PC rolls over, will mispredict

65 CSE502: Computer Architecture Branches Have Locality If a branch was previously taken… – Theres a good chance itll be taken again for(i=0; i < ; i++) { /* do stuff */ } This branch will be taken 99,999 times in a row. This branch will be taken 99,999 times in a row.

66 CSE502: Computer Architecture Last Outcome Predictor Do what you did last time 0xDC08:for(i=0; i < ; i++) { 0xDC44:if( ( i % 100) == 0 ) tick( ); 0xDC50:if( (i & 1) == 1) odd( ); } T N

67 CSE502: Computer Architecture Saturating Two - Bit Counter 01 FSM for Last-Outcome Prediction FSM for 2bC (2-bit Counter) Predict N-t Predict T Transition on T outcome Transition on N-t outcome

68 CSE502: Computer Architecture Typical Organization of 2bC Predictor PC hash 32 or 64 bits log 2 n bits n entries/counters Prediction FSM Update Logic FSM Update Logic table update Actual outcome

69 CSE502: Computer Architecture Track the History of Branches PC Previous Outcome 1 Counter if prev=0 30 Counter if prev=1 133 prev = 1 3 prediction = T 3 prev = 1 3 prediction = T 3 prev = 1 3 prediction = T 2 prev = 0 3 prediction = T 2

70 CSE502: Computer Architecture Deeper History Covers More Patterns Counters learn pattern of prediction PC Previous 3 Outcomes Counter if prev=000 Counter if prev=001 Counter if prev=010 Counter if prev= ; 011 0; 110 0; … (0011)*

71 CSE502: Computer Architecture Predictor Training Time Ex: prediction equals opposite for 2 nd most recent Hist Len = 2 4 states to train: NN T NT T TN N TT N Hist Len = 3 8 states to train: NNN T NNT T NTN N NTT N TNN T TNT T TTN N TTT N

72 CSE502: Computer Architecture Predictor Organizations PC Hash Different pattern for each branch PC PC Hash Shared set of patterns PC Hash Mix of both

73 CSE502: Computer Architecture Two - Level Predictor Organization Branch History Table (BHT) – 2 a entries – h-bit history per entry Pattern History Table (PHT) – 2 b sets – 2 h counters per set Total Size in bits – h 2 a + 2 (b+h) 2 PC Hash a b h Each entry is a 2-bit counter

74 CSE502: Computer Architecture Combined Indexing gshare (S. McFarling) PC Hash k XOR k = log 2 counters k

75 CSE502: Computer Architecture OoO Execution Out-of-Order execution (OoO) – Totally in the hardware – Also called Dynamic scheduling Fetch many instructions into instruction window – Use branch prediction to speculate past branches Rename regs. to avoid false deps. (WAW and WAR) Execute insns. as soon as possible – As soon as deps. (regs and memory) are known Todays machines: 100+ insns. scheduling window

76 CSE502: Computer Architecture Superscalar != Out-of-Order A: R1 = Load 16[R2] B: R3 = R1 + R4 C: R6 = Load 8[R9] D: R5 = R2 – 4 E: R7 = Load 20[R5] F: R4 = R4 – 1 G: BEQ R4, #0 C C D D E E cache miss B B C C D D E E F F G G 10 cycles B B F F G G 7 cycles A A B B C C D D E E F F G G C C D D E E F F G G B B 5 cycles B B C C D D E E F F G G 8 cycles A A cache miss 1-wide In-Order A A cache miss 2-wide In-Order A A 1-wide Out-of-Order A A cache miss 2-wide Out-of-Order

77 CSE502: Computer Architecture Review of Register Dependencies A: R1 = R2 + R3 B: R4 = R1 * R R1 R2 R3 R4 Read-After-Write A A B B R1 R2 R3 R B B A A A: R1 = R3 / R4 B: R3 = R2 * R4 Write-After-Read R1 R2 R3 R A A B B R1 R2 R3 R A A B B Write-After-Write A: R1 = R2 + R3 B: R1 = R3 * R R1 R2 R3 R A A B B R1 R2 R3 R A A B B

78 CSE502: Computer Architecture Register Renaming Register renaming (in hardware) – Change register names to eliminate WAR/WAW hazards – Arch. registers (r1,f0…) are names, not storage locations – Can have more locations than names – Can have multiple active versions of same name How does it work? – Map-table: maps names to most recent locations – On a write: allocate new location, note in map-table – On a read: find location of most recent write via map-table

79 CSE502: Computer Architecture Tomasulos Algorithm Reservation Stations (RS): instruction buffer Common data bus (CDB): broadcasts results to RS Register renaming: removes WAR/WAW hazards Bypassing (not shown here to make example simpler)

80 CSE502: Computer Architecture Tomasulo Data Structures value V1V2 FU T T2T1Top == Map Table Reservation Stations CDB.V CDB.T Fetched insns Regfile R T ==

81 CSE502: Computer Architecture Where is the register rename? Value copies in RS (V1, V2) Insn. stores correct input values in its own RS entry Free list is implicit (allocate/deallocate as part of RS) value V1V2 FU T T2T1Top == Map Table Reservation Stations CDB.V CDB.T Fetched insns Regfile R T ==

82 CSE502: Computer Architecture Precise State Speculative execution requires – (Ability to) abort & restart at every branch – Abort & restart at every load Synchronous (exception and trap) events require – Abort & restart at every load, store, divide, … Asynchronous (hardware) interrupts require – Abort & restart at every ?? Real world: bite the bullet – Implement abort & restart at every insn. – Called precise state

83 CSE502: Computer Architecture Complete and Retire Complete (C): insns. write results into ROB – Out-of-order: dont block younger insns. Retire (R): a.k.a. commit, graduate – ROB writes results to register file – In-order: stall back-propagates to younger insns. regfile L1-D I$ BPBP Re-Order Buffer (ROB) CR

84 CSE502: Computer Architecture P6 Data Structures value V1V2 FU T+ T2T1Top == Map Table RS CDB.V CDB.T Dispatch Regfile T == Rvalue ROB Head Retire Tail Dispatch

85 CSE502: Computer Architecture MIPS R10K: Alternative Implementation One big physical register file holds all data - no copies +Register file close to FUs small and fast data path – ROB and RS on the side used only for control and tags FU T+ T2+T1+Top == Map Table RS CDB.T Dispatch T == Rvalue ROB Head Retire Tail Dispatch ToldTT Free List T Arch. Map Regfile

86 CSE502: Computer Architecture Executing Memory Instructions If R1 != R7 – Then Load R8 gets correct value from cache If R1 == R7 – Then Load R8 should get value from the Store – But it didnt! Load R3 = 0[R6] Add R7 = R3 + R9 Store R4 0[R7] Sub R1 = R1 – R2 Load R8 = 0[R1] Issue Cache Miss! IssueCache Hit! Miss serviced… Issue But there was a later load…

87 CSE502: Computer Architecture Memory Disambiguation Problem Ordering problem is a data-dependence violation Imprecise memory worse than imprecise registers Why cant this happen with non-memory insts? – Operand specifiers in non-memory insns. are absolute R1 refers to one specific location – Operand specifiers in memory insns. are ambiguous R1 refers to a memory location specified by the value of R1. When pointers (e.g., R1) change, so does this location

88 CSE502: Computer Architecture Two Problems Memory disambiguation on loads – Do earlier unexecuted stores to the same address exist? Binary question: answer is yes or no Store-to-load forwarding problem – Im a load: Which earlier store do I get my value from? – Im a store: Which later load(s) do I forward my value to? Non-binary question: answer is one or more insn. identifiers

89 CSE502: Computer Architecture Load/Store Queue (1/2) Load/store queue (LSQ) – Completed stores write to LSQ – When store retires, head of LSQ written to L1-D (or write buffer) – When loads execute, access LSQ and L1-D in parallel Forward from LSQ if older store with matching address

90 CSE502: Computer Architecture Load/Store Queue (2/2) regfile L1-D I$ BPBP ROB LSQ load/store store data addr load data Almost a real processor diagram

91 CSE502: Computer Architecture Loads Execute When … Most aggressive approach Relies on fact that store load forwarding is rare Greatest potential IPC – loads never stall Potential for incorrect execution – Need to be able to undo bad loads

92 CSE502: Computer Architecture Detecting Ordering Violations Case 1: Older store execs before younger load – No problem; if same address st ld forwarding happens Case 2: Older store execs after younger load – Store scans all younger loads – Address match ordering violation

93 CSE502: Computer Architecture Loads Checking for Earlier Stores On Load dispatch, find data from earlier Store ST 0x4000 ST 0x4120 LD 0x4000 = = Address Bank Data Bank = = = = = = = = = = = = 0 No earlier matches Addr match Valid store Use this store Need to adjust this so that load need not be at bottom, and LSQ can wrap-around Need to adjust this so that load need not be at bottom, and LSQ can wrap-around If |LSQ| is large, logic can be adapted to have log delay If |LSQ| is large, logic can be adapted to have log delay

94 CSE502: Computer Architecture Similar Logic to Previous Slide Data Forwarding On execute Store (STA+STD), check for later Loads ST 0x4000 ST 0x4120 LD 0x4000 Addr Match Is Load Capture Value Overwritten Data Bank This is ugly, complicated, slow, and power hungry ST 0x4000

95 CSE502: Computer Architecture Data-Capture Scheduler Dispatch: read available operands from ARF/ROB, store in scheduler Commit: Missing operands filled in from bypass Issue: When ready, operands sent directly from scheduler to functional units Fetch & Dispatch Fetch & Dispatch ARF PRF/ROB Data-Capture Scheduler Data-Capture Scheduler Functional Units Functional Units Physical register update Bypass

96 CSE502: Computer Architecture Scheduling Loop or Wakeup-Select Loop Wake-Up Part: – Executing insn notifies dependents – Waiting insns. check if all deps are satisfied If yes, wake up instutrction Select Part: – Choose which instructions get to execute More than one insn. can be ready Number of functional units and memory ports are limited

97 CSE502: Computer Architecture Interaction with Execution A A Select Logic SRSR SRSR D D SLSL SLSL opcode Val L Val R Val L Val R Val L Val R Val L Val R Payload RAM

98 CSE502: Computer Architecture Simple Scheduler Pipeline Select Payload Wakeup A: Execute Capture B: tag broadcast result broadcast enable capture on tag match Select Payload Execute Wakeup Capture C: enable capture tag broadcast Cycle i Cycle i+1 A A B B C C Very long clock cycle

99 CSE502: Computer Architecture Deeper Scheduler Pipeline Select Payload A: Execute Capture B: tag broadcast result broadcast enable capture Select Payload Execute Capture C: enable capture tag broadcast Cycle iCycle i+1 Select Payload Execute Cycle i+2Cycle i+3 Wakeup A A B B C C Faster, but Capture & Payload on same cycle

100 CSE502: Computer Architecture Very Deep Scheduler Pipeline Select Payload A: Execute Capture C: Cycle i Wakeup i+1i+2i+3 Select Payload Execute Wakeup Capture Select Payload Execute i+4i+5 D: A A C C B B D D Wakeup Capture B: Select Payload Execute A&B both ready, only A selected, B bids again A C and C D must be bypassed, B D OK without bypass A C and C D must be bypassed, B D OK without bypass i+6 Dependent instructions cant execute back-to-back

101 CSE502: Computer Architecture Non-Data-Capture Scheduler Fetch & Dispatch Fetch & Dispatch ARF PRF Scheduler Functional Units Functional Units Physical register update Fetch & Dispatch Fetch & Dispatch Unified PRF Unified PRF Scheduler Functional Units Functional Units Physical register update

102 CSE502: Computer Architecture Pipeline Timing Select Payload Wakeup Execute Select Payload Execute Select Payload Read Operands from PRF Wakeup Execute Select Payload Read Operands from PRF Exec S S X X E E X X X X S S X X E E Skip Cycle Substantial increase in schedule-to-execute latency Data-Capture Non-Data-Capture

103 CSE502: Computer Architecture Handling Multi-Cycle Instructions Sched PayLd Exec Sched PayLd Exec Add R1 = R2 + R3 Xor R4 = R1 ^ R5 Sched PayLd Exec Add R4 = R1 + R5 WU Sched PayLd Exec Mul R1 = R2 × R3 Exec Instructions cant execute too early WU

104 CSE502: Computer Architecture Non-Deterministic Latencies Real situations have unknown latency – Load instructions Latency {L1_lat, L2_lat, L3_lat, DRAM_lat} DRAM_lat is not a constant either, queuing delays – Architecture specific cases PowerPC 603 has early out for multiplication Intel Core 2s has early out divider also

105 CSE502: Computer Architecture Load-Hit Speculation Caches work pretty well – Hit rates are high (otherwise we wouldnt use caches) – Assume all loads hit in the cache Sched PayLd Exec R2 = R1 + #4 Sched PayLd Exec R1 = 16[$sp] Exec Cache hit, data forwarded Broadcast delayed by DL1 latency What to do on a cache miss?

106 CSE502: Computer Architecture Simple Select Logic Scheduler Entries 1 S entries yields O(S) gate delay Grant 0 = 1 Grant 1 = !Bid 0 Grant 2 = !Bid 0 & !Bid 1 Grant 3 = !Bid 0 & !Bid 1 & !Bid 2 Grant n-1 = !Bid 0 & … & !Bid n-2 1 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 grant 0 x i = Bid i grant i grant 1 grant 2 grant 3 grant 4 grant 5 grant 6 grant 7 grant 8 grant 9 O(log S) gates

107 CSE502: Computer Architecture Implementing Oldest First Select B C A D E F H G Age-Aware Select Logic Grant Must broadcast grant age to instructions

108 CSE502: Computer Architecture Problems in N-of-M Select B C A D E F H G Age-Aware 1-of-M Age-Aware 1-of-M N layers O(N log M) delay O(log M) gate delay / select

109 CSE502: Computer Architecture Select Binding XOR SUB Select Logic for ALU 1 Select Logic for ALU ADD CMP Not-Quite-Oldest-First: Ready insns are aged 2, 3, 4 Issued insns are 2 and 4 Not-Quite-Oldest-First: Ready insns are aged 2, 3, 4 Issued insns are 2 and 4 XOR SUB Select Logic for ALU 1 Select Logic for ALU ADD CMP (Idle) Wasted Resources: 3 instructions are ready Only 1 gets to issue Wasted Resources: 3 instructions are ready Only 1 gets to issue

110 CSE502: Computer Architecture Execution Ports Divide functional units into P groups – Called ports Area only O(P 2 M log M), where P << F Logic for tracking bids and grants less complex (deals with P sets) ADD 3 3 LOAD 5 5 ADD 2 2 MUL 8 8 Port 0Port 1Port 2Port 3Port 4 ALU 1 ALU 2 ALU 3 M/D Shift FAdd FM/DSIMDLoadStore

111 CSE502: Computer Architecture Decentralized RS Natural split: INT vs. FP Port 1 Port 3 StoreLoad ALU 1 ALU 2 FP-Ld FP-St FAddFM/D L1 Data Cache FP-only wakeup Int-only wakeup INT RF INT RF FP RF FP RF FP ClusterInt Cluster Often implies non-ROB based physical register file: One unified integer PRF, and one unified FP PRF, each managed separately with their own free lists Port 0 Port 2

112 CSE502: Computer Architecture Higher Complexity not Worth Effort Effort Performance Scalar In-Order Moderate-Pipe Superscalar/OOO Very-Deep-Pipe Aggressive Superscalar/OOO Made sense to go Superscalar/OOO: good ROI Made sense to go Superscalar/OOO: good ROI Very little gain for substantial effort Very little gain for substantial effort

113 CSE502: Computer Architecture SMP Machines SMP = Symmetric Multi-Processing – Symmetric = All CPUs have equal access to memory OS seems multiple CPUs – Runs one process (or thread) on each CPU CPU 0 CPU 1 CPU 2 CPU 3

114 CSE502: Computer Architecture MP Workload Benefits 3-wide OOO CPU 3-wide OOO CPU Task ATask B 4-wide OOO CPU 4-wide OOO CPU Task ATask B Benefit 3-wide OOO CPU 3-wide OOO CPU Task A Task B 3-wide OOO CPU 3-wide OOO CPU 2-wide OOO CPU 2-wide OOO CPU Task B Task A 2-wide OOO CPU 2-wide OOO CPU runtime

115 CSE502: Computer Architecture … If Only One Task Available 3-wide OOO CPU 3-wide OOO CPU Task A 4-wide OOO CPU 4-wide OOO CPU Task A Benefit 3-wide OOO CPU 3-wide OOO CPU 3-wide OOO CPU 3-wide OOO CPU Task A 2-wide OOO CPU 2-wide OOO CPU 2-wide OOO CPU 2-wide OOO CPU Task A runtime Idle No benefit over 1 CPU Performance degradation!

116 CSE502: Computer Architecture Chip-Multiprocessing (CMP) Simple SMP on the same chip – CPUs now called cores by hardware designers – OS designers still call these CPUs Intel Smithfield Block Diagram AMD Dual-Core Athlon FX

117 CSE502: Computer Architecture On-chip Interconnects (1/4) Today, (Core+L1+L2) = core – (L3+I/O+Memory) = uncore How to interconnect multiple cores to uncore? Possible topologies – Bus – Crossbar – Ring – Mesh – Torus LLC $ Memory Controller Memory Controller Core $ $ $ $ $ $ $ $

118 CSE502: Computer Architecture On-chip Interconnects (2/4) Possible topologies – Bus – Crossbar – Ring – Mesh – Torus $ Bank 0 $ Bank 0 Memory Controller Memory Controller Core $ $ $ $ $ $ $ $ $ Bank 1 $ Bank 1 $ Bank 2 $ Bank 2 $ Bank 3 $ Bank 3 Oracle UltraSPARC T5 (3.6GHz, 16 cores, 8 threads per core)

119 CSE502: Computer Architecture On-chip Interconnects (3/4) Possible topologies – Bus – Crossbar – Ring – Mesh – Torus $ Bank 0 $ Bank 0 Memory Controller Memory Controller Core $ $ $ $ $ $ $ $ $ Bank 1 $ Bank 1 $ Bank 2 $ Bank 2 $ Bank 3 $ Bank 3 Intel Sandy Bridge (3.5GHz, 6 cores, 2 threads per core) 3 ports per switch Simple and cheap Can be bi-directional to reduce latency

120 CSE502: Computer Architecture On-chip Interconnects (4/4) Possible topologies – Bus – Crossbar – Ring – Mesh – Torus Tilera Tile64 (866MHz, 64 cores) Core $ $ $ Bank 1 $ Bank 1 $ Bank 0 $ Bank 0 Core $ $ $ $ $ Bank 4 $ Bank 4 Core $ $ $ Bank 3 $ Bank 3 Memory Controller Memory Controller Core $ $ $ Bank 2 $ Bank 2 Core $ $ $ Bank 7 $ Bank 7 Core $ $ $ Bank 6 $ Bank 6 Core $ $ $ Bank 5 $ Bank 5 Up to 5 ports per switch Tiled organization combines core and cache

121 CSE502: Computer Architecture Multi-Threading Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC – Poor utilization of transistors SMP: 2-4 CPUs, but need independent threads – Poor utilization as well (if limited tasks) {Coarse-Grained,Fine-Grained,Simultaneous}-MT – Use single large uni-processor as a multi-processor Core provide multiple hardware contexts (threads) – Per-thread PC – Per-thread ARF (or map table) – Each core appears as multiple CPUs OS designers still call these CPUs

122 CSE502: Computer Architecture Scalar Pipeline Time Dependencies limit functional unit utilization

123 CSE502: Computer Architecture Superscalar Pipeline Time Higher performance than scalar, but lower utilization

124 CSE502: Computer Architecture Chip Multiprocessing (CMP) Time Limited utilization when running one thread

125 CSE502: Computer Architecture Coarse-Grained Multithreading Time Only good for long latency ops (i.e., cache misses) Hardware Context Switch

126 CSE502: Computer Architecture Fine-Grained Multithreading Time Saturated workload -> Lots of threads Unsaturated workload -> Lots of stalls Intra-thread dependencies still limit performance

127 CSE502: Computer Architecture Simultaneous Multithreading Time Max utilization of functional units

128 CSE502: Computer Architecture Paired vs. Separate Processor/Memory? Separate CPU/memory – Uniform memory access (UMA) Equal latency to memory – Low peak performance Paired CPU/memory – Non-uniform memory access (NUMA) Faster local memory Data placement matters – High peak performance CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) MemRRRR

129 CSE502: Computer Architecture Issues for Shared Memory Systems Two big ones – Cache coherence – Memory consistency model Closely related Often confused

130 CSE502: Computer Architecture A: 0 Cache Coherence: The Problem Variable A initially has value 0 P1 stores value 1 into A P2 loads A from memory and sees old value 0 Bus P1 t1: Store A=1 P2 A: 0 A: 0 1 A: 0 Main Memory L1 t2: Load A? L1 Need to do something to keep P2s cache coherent

131 CSE502: Computer Architecture Simple MSI Protocol Store / BusRdX Invalid Load / BusRd Shared Load / -- BusRd / [BusReply] Cache Actions: Load, Store, Evict Bus Actions: BusRd, BusRdX BusInv, BusWB, BusReply Modified BusRdX / BusReply Evict / -- BusRd / BusReply Evict / BusWB Load, Store / -- Store / BusInv BusRdX, BusInv / [BusReply] Usable coherence protocol

132 CSE502: Computer Architecture Coherence vs. Consistency Coherence concerns only one memory location Consistency concerns ordering for all locations A Memory System is Coherent if – Can serialize all operations to that location Operations performed by any core appear in program order – Read returns value written by last store to that location A Memory System is Consistent if – It follows the rules of its Memory Model Operations on memory locations appear in some defined order

133 CSE502: Computer Architecture Sequential Consistency (SC) switch randomly set after each memory op processors issue memory ops in program order P1P2 P3 Memory Defines Single Sequential Order Among All Ops.

134 CSE502: Computer Architecture Mutex Example w/ Store Buffer P1 P2 lockA: A = 1;lockB: B=1; if (B != 0)if (A != 0) { A = 0; goto lockA; } { B = 0; goto lockB; } /* critical section*//* critical section*/ A = 0;B = 0; Shared Bus P1 Read B t1 t3 P2 Read A t2 t4 A: 0 B: 0 Write A Write B Does not work

135 CSE502: Computer Architecture Relaxed Consistency Models Sequential Consistency (SC): – R W, R R, W R, W W Total Store Ordering (TSO) relaxes W R – R W, R R, W W Partial Store Ordering relaxes W W (coalescing WB) – R W, R R Weak Ordering or Release Consistency (RC) – All ordering explicitly declared Use fences to define boundaries Use acquire and release to force flushing of values X Y X must complete before Y X Y X must complete before Y

136 CSE502: Computer Architecture Good Luck!


Download ppt "CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)"

Similar presentations


Ads by Google