Pentium Pro Case Study Prof. Mikko H. Lipasti

Pentium Pro Case Study Prof. Mikko H. Lipasti
University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

Pentium Pro Case Study Microarchitecture Design Methodology
Order-3 Superscalar Out-of-Order execution Speculative execution In-order completion Design Methodology Performance Analysis Retrospective

Goals of P6 Microarchitecture
IA-32 Compliant Performance (Frequency - IPC) Validation Die Size Schedule Power

P6 – The Big Picture BTB – branch target buffer
ICU – Instruction Cache Unit BAC – branch address calculation (PC-relative branch targets) ROB – reorder buffer RRF – real register file (architected register file) AGU – address generation units DCU – data cache unit MOB – memory order buffer – maintains RAW for memory dependences (ld/st) IEU – integer execution unit JEU – jump/branch execution unit

Memory Hierarchy BIU – bus interface unit Separate L2 bus and memory bus (backdoor L2 cache) results in more bandwidth from CPU to “all of memory” L2 fills, writebacks have to travel through core’s BIU Level 1 instruction and data caches - 2 cycle access time Level 2 unified cache - 6 cycle access time Separate level 2 cache and memory address/data bus

Instruction Fetch L2 Cache (256Kb) Stream Buffer f m u B y Inst. + b .
Length Decoder Marks P e d i c M k Instruction Victim ICache Stream TLB Buffer Physical Addr. L2 Cache (256Kb) F h A Next Logic Other Fetch Requests Branch Target Buffer (512) To Decode 1 6 b y + m D x (8Kb) 2 cycle Branch Target Fetch every other cycle (not every cycle) since BTB latency of two cycles BTB described in later slide If miss in BTB, early decode finds branches, checks PC-relative offset, then implements backward-taken, forward not-taken (BTFN) static prediction BTB is not very big (512 entries) so this second-level static prediction is very important Small I-cache is augmented with stream buffer (sequential prefetching from L2) and victim cache (cheap associativity, needed for self-modifying code) Instruction supplied from either victim cache, I-cache, or stream buffer Early instruction length decoder tries to find instruction boundaries

Instruction Cache Unit
TLB lookup is done in parallel with all caches; physical address from TLB compared against physical tags in I$, victim cache, stream buffer Segment checks are delayed until later in the pipeline (x86 has segment+address = linear address). Front end operates strictly on linear addresses. If you overflow from one segment, later pipestages will catch this and treat as exception.

Branch Target Buffer F e t c h A d r . T a g 4 - b i B H R O f s p
Way 0 1 2 8 S Way 1 Way 3 Tag Compare PHT 6 n / Return Stack Prediction Control Logic & Target Addr. Fetch Address This is a PAs scheme in Yeh & Patt’s classification (per-address BHR, shared PHT) Trivia: Intel architects dispute that Yeh & Patt invented 2-level branch prediction; Yeh did an internship at Intel and subsequently published the Y&P 2-level paper. Ruffled feathers were smoothed since Yeh later took a permanent job at Intel. Pattern History Table (PHT) is not speculatively updated A speculative Branch History Register (BHR) and prediction state is maintained Uses speculative prediction state if it exist for that branch

Branch Prediction Algorithm
Two-level scheme with BHR requires updates to BHR before next prediction is made (otherwise BHR is “stale” and can predict poorly) BHR can’t be updated right away because outcome is not known till much later. Solution: update with speculative outcome (predicted outcome). What if prediction was wrong? BHR is now corrupted. Solution: keep nonspeculative BHR that doesn’t get updated till branch resolves. On mispredictions, the speculative BHR reverts to this. Unclear how important the speculative BHR repair really is. Speculative updates are important (stale BHR leads to lousy prediction rates), But does it really need to be repaired? Open question… Current prediction updates the speculative history prior to the next instance of the branch instruction Branch History Register (BHR) is updated during branch execution Branch recovery flushes front-end and drains the execution core Branch mis-prediction resets the speculative branch history state to match BHR

Instruction Decode - 1 Branch instruction detection
Will see later that lots of branches are resolved with BAC and BTFN static predictor Branch instruction detection Branch address calculation - Static prediction and branch always execution One branch decode per cycle (break on branch)

Instruction Decode - 2 Covered this in an earlier lecture—coverage here can be pretty brief Decoder 0 – fully general; if x86 op requires more than 4uops then sequences uops from uROM Decoder 1&2 – can only handle very simple x86 ops that translate one-to-one to uops (e.g. register-register arithmetic ops) Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re-filled Macro-instructions must shift from decoder 2 to decoder 1 to decoder 0

What is a uop? Small two-operand instruction - Very RISC like.
IA-32 instruction add (eax),(ebx) MEM(eax) <- MEM(eax) + MEM(ebx) Uop decomposition: ld guop0, (eax) guop0 <- MEM(eax) ld guop1, (ebx) guop1 <- MEM(ebx) add guop0,guop1 guop0 <- guop0 + guop1 sta eax std guop0 MEM(eax) <- guop0 Sta & std – equivalent to st guop0,(eax) in a RISC ISA. Ppro splits address generation and data movement into separate uops. Why? Because two very different things happen with stores in an OOO processor: address generation is needed to resolve MEM dependences with stores. This needs to happen right away to avoid delaying loads unnecessarily. The actual data movement (std) cannot occur until commit. Hence it is less critical. It makes sense to split the two events in to two uops.

Instruction Dispatch Renaming Allocator u o P Q e ( 6 ) D i s p a t c
B f r 3 Mux Logic To Reservation Station Retirement Info 2 cycles Register Renaming Allocation requirements “3-or-none” Reorder buffer entries Reservation station entry Load buffer or store buffer entry Dispatch buffer “probably” dispatches all 3 uops before re-fill

Register Renaming - 1 EAX… are integer registers FST… are FP registers GuoP0… are temporary registers that uops use (e.g. registers for x86 ops that have memory operands) Similar to Tomasulo’s Algorithm - Uses ROB entry number as tags The register alias tables (RAT) maintain a pointer to the most recent data for the renamed register Execution results are stored in the ROB

Register Renaming - Example
Initially RAT/EAX points to ROB6 (since sub has not yet completed); draw arrow from RAT/EAX to ROB6 Initially RAT/EBX and RAT/ECX and RAT/FST0 and RAT/FST1 point to RRF entries for each Now show sub committing: ROB6 result is copied to RRF, RAT/EAX changed to point to RRF/EAX Now show first add getting dispatched: it picks up the ebx and eax mapping from RAT, then it updates RAT/EAX to point to ROB7 Second add gets dispatched: picks up eax and ecx source mappings, changes RAT/EAX to point to ROB8 Discussion: all three things happen in the same cycle (complete sub, dispatch two adds). Since there are RAW/WAR/WAW dependences occurring to the RAT table, hazards exist and must be resolved. This is done with complex tag-matching bypassing network that checks for all concurrent reads/writes to/from the RAT. Finally, fxch (floating point exchange) is dispatched: the trick here is to just flip the two pointers: RAT/FST0 now points to FST1 and RAT/FST0 now points to FST0. The semantics of this instruction can actually be completed in the rename stage. © Shen, Lipasti

Challenges to Register Renaming
8086 legacy code: byte addressable registers lead to “partial register write” problem. Cannot easily rename, unless all renaming occurred at byte level (I.e. reg A gets half its data from the first move, the rest from the second move). Trivia: Microsoft “promised” this type of code would be gone from Windows 95, so Intel engineers didn’t deal with it. Instead, the Ppro would stall dispatch on a “partial register write” and wait for conflicting writes to commit to RRF, resulting in very poor performance for 16-bit code. Of course, Microsoft lied. Windows 95 was chock full of 16-bit code, causing a real problem for Intel. The older Pentium design was actually faster, in many cases, than Pentium Pro, due to the partial register write stalls. The next version (Pentium II) fixed this problem (how?)

Out-of-Order Execution Engine
Loads are not issued until addresses of all prior stores are known: otherwise RAW violation could occur. MOB checks for RAW on addresses, and then satisfies load either from DCU (if no RAW), or from a prior store (if RAW exists). RS not truly centralized (read the book). Binding means instructions that can execute on multiple Units (I.e. integer ALU ops on IEU1 or IEU2, or address generation ops on AGU0 or AGU1) are “bound” at dispatch time to a particular FU (0 or 1). Dispatch-time binding is set based on a load-balancing algorithm. However, this can lead to non-optimal utilization of pseudo-centralized RS. In-order branch issue and execution In-order load/store issue to address generation units Instruction execution and result bus scheduling Is the reservation station “truly” centralized & what is “binding”?

Reservation Station Cycle 1 Cycle 2 Order checking
Up to 5 instructions can issue in a cycle. However, only 3 broadcast busses for wakeup (writeback), so these have to be separately scheduled. Since FU latencies are not uniform, this is not trivial. Cycle 1 Order checking Operand availability Cycle 2 Writeback bus scheduling

Memory Ordering Buffer (MOB)
“coherency checking” – in multiprocessor systems loads have to be tracked against writes at remote processors (later, esp. in 757) Store coloring – loads marked with “color” of previous store in program order at dispatch. Load cannot issue to DCU until store of that color has done AGEN and reached the SAB. When load does issue from Load Buffer, conflict logic checks for RAW with SAB; bypass from SDB on RAW. What if? SDB not available yet (remember, std is a separate uop and may be stalled)? Stall load Store byte in SDB, but load word in LB? Probably stall until write commits. Store partially overlaps with load? Probably stall until write commits. Some ISAs avoid these problems: no subword loads or stores (all natural size, e.g. 8 bytes in Alpha ISA); also all loads/stores must be naturally aligned to avoid partial overlaps. This means more headaches for the programmer/compiler: e.g. void myfunction (sometype *f) { … x = *f; /* how does the compiler know that f is naturally aligned??? */ } Load buffer retains loads until completed, for coherency checking Store forwarding out of store buffers 2 cycle latency through MOB “Store Coloring” - Load instructions are tagged by the last store

Instruction Completion
Handles all exception/interrupt/trap conditions Handles branch recovery OOO core drains out right-path instructions, commits to RRF In parallel, front end starts fetching from target/fall-through However, no renaming is allowed until OOO core is drained After draining is done, RAT is reset to point to RRF Avoids checkpointing RAT, recovering to intermediate RAT state Commits execution results to the architectural state in-order Retirement Register File (RRF) Must handle hazards to RRF (writes/reads in same cycle) Must handle hazards to RAT (writes/reads in same cycle) “Atomic” IA-32 instruction completion uops are marked as 1st or last in sequence exception/interrupt/trap boundary 2 cycle retirement NOTE: in the MIPS R10K paper on the reading list, RAT (map table) checkpoint is covered. You should understand this principle: each branch checkpoints the RAT state, so that on branch recovery execution can resume using those register mappings. Ppro avoids this by forcing OOO core to drain and resetting RAT to point to RRF before resuming execution from redirected path. This doesn’t give the best performance (since redirected instructions may have to wait for OOO core to drain), but is much simpler. However, in MIPS R10K there is no RRF, and no “default” mappings for map table, so this can’t be done.

Pentium Pro Design Methodology - 1

Pentium Pro Performance Analysis
Observability On-chip event counters Dynamic analysis Benchmark Suite BAPco Sysmark bit Windows NT applications Winstone bit Windows NT applications Some SPEC95 benchmarks On-chip event counters: designers include counters that tally up “interesting” events like cache misses, branch mispredictions, etc. These counters can be accessed via special instructions. Allows performance characterization of applications and processors at full native processor speed (no simulation or external instrumentation needed). Very popular tool these days. Anecdote: always have to fight with designers to get these in, since they’re not required for functional correctness. However, for performance analysis and performance debugging they are essential.

Performance – Run Times
How long each benchmark ran (apologies for low quality of x-axis labels) SPEC95 on the right (3 benchmarks) All others are “Windows” applications like pagemaker, photoshop, powerpoint, etc.

Performance – IPC vs. uPC
About 30% higher IPC than Pentium (30% faster per clock cycle). Deeper pipeline resulted in 30%+ higher clock frequency in same process. Overall apples-to-apples performance improvement over Pentium in same process technology was about 60% (!) Uop expansion ratio about 2:1. 3-uop wide machine only sustains <1 uop per cycle: hazards, cache misses, branch mispredictions (deep pipeline)

Performance – IPC vs. uPC
Histogram of # uops retired in a cycle. Vast majority of cycles retire 0 uops (left), 0 instructions (right) <20% of cycles retire one uop, <10% cycles retire 2 uops, <20% retire 3 uops Does indicate “burstiness” of retirement: more 3uop cycles than 2uop cycles. Once condition that prevents retirement (usually cache miss or branch mispredict) is resolved, a “burst” of uops is retired.

Performance – Cache Misses
Cache misses per cycle – rates look pretty good. However, normalizing to cache misses per instruction or per reference would increase rates somewhat. Data misses appear much more prevalent than instruction misses. Also, L2 misses are rare, but cost is much higher (6 cycles for L1 miss, 50 or more cycles for L2 miss).

Performance – Branch Prediction
Direction prediction rate pretty good. BTB miss rate is terrible: since BTB is tagged and BHR/PHT are integrated into BTB, how can mispredict rate be so good? Answer: performance counters include static “BTFN” predictor as correct predictions, so this skews result quite a bit. Also shows importance of BAC logic in front end: even if BTB misses, the target address (and static prediction) are available only a few cycles later.

Performance (Frequency - IPC)
Conclusions IA-32 Compliant Performance (Frequency - IPC) 366.0 ISpec92 283.2 FSpec92 8.09 SPECint95 6.70 SPECfp95 Validation Die Size - Fabable Schedule - 1 year late Power - 30% frequency bump over Pentium, 30% IPC bump over Pentium SPEC ratings were fastest at the time (for a few months, then got leapfrogged by Alpha) Same core lives on: Pentium-II Pentium-III Pentium-III with integrated L2 Now Pentium M with some changes; targeted for low-power (notebook) P-M led to Core Duo (dual core version) Then Core 2 Duo (Merom project), Core i7 (Nehalem 1st gen, then Sandybridge with integrated GPU) Pentium 4 was an all-new core: 2x deeper pipeline, just about all aspects of uarch are different. However, P 4 has real problems with power consumption: super-high frequency, deep pipelines burn power. Not suitable for laptop application. Abandoned after success of Pentium M

Retrospective Most commercially successful microarchitecture in history Evolution Pentium II/III, Xeon, etc. Derivatives with on-chip L2, ISA extensions, etc. Replaced by Pentium 4 as flagship in 2001 High frequency, deep pipeline, extreme speculation Resurfaced as Pentium M in 2003 Initially a response to Transmeta in laptop market Pentium 4 derivative (90nm Prescott) delayed, slow, hot Core Duo, Core 2 Duo, Core i7 replaced Pentium 4 © Shen, Lipasti

Microarchitectural Updates
Pentium M (Banias), Core Duo (Yonah) Micro-op fusion (also in AMD K7/K8) Multiple uops in one: (add eax,[mem] => ld/alu), sta/std These uops decode/dispatch/commit once, issue twice Better branch prediction Loop count predictor Indirect branch predictor Slightly deeper pipeline (12 stages) Extra decode stage for micro-op fusion Extra stage between issue and execute (for RS/PLRAM read) Data-capture reservation station (payload RAM) Clock gated for 32 (int) , 64 (fp), and 128 (SSE) operands © Shen, Lipasti

Core 2 Duo (Merom) 64-bit ISA from AMD K8 Macro-op fusion Merge uops from two x86 ops E.g. cmp, jne => cmpjne 4-wide decoder (Complex + 3x Simple) Peak x86 decode throughput is 5 due to macro-op fusion Loop buffer Loops that fit in 18-entry instruction queue avoid fetch/decode overhead Even deeper pipeline (14 stages) Larger reservation station (32), instruction window (96) Memory dependence prediction © Shen, Lipasti

Nehalem (Core i7/i5/i3) RS size 36, ROB 128 Loop cache up to 28 uops L2 branch predictor L2 TLB I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M Simultaneous multithreading RAS now renamed (repaired) 6 issue, 48 load buffers, 32 store buffers New system interface (QPI) – finally dropped front-side bus Integrated memory controller (up to 3 channels) New STTNI instructions for string/text handling © Shen, Lipasti

Sandybridge/Ivy Bridge (2nd-3rd generation Core i7) On-chip integrated graphics (GPU) Decoded uop cache up to 1.5K uops, handles loops, but more general 54-entry RS, 168-entry ROB Physical register file: 144 FP, 160 integer 256-bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any combination, 2x128-bit load path from L1 D$ © Shen, Lipasti

Haswell/Broadwell/Skylake: wider & deeper 8-wide issue (up from 6 wide) 4th integer ALU, third AGU, second branch unit 60-entry RS, 192-entry ROB 72-entry load queue/42-entry store queue Physical register file: 168 FP, 168 integer Doubled FP throughput (32 SP/16 DP) Load/store bandwidth to L1 doubled (64B/32B) TSX (transactional memory) Integrated voltage regulator © Shen, Lipasti

Pentium Pro Case Study Prof. Mikko H. Lipasti

Similar presentations

Presentation on theme: "Pentium Pro Case Study Prof. Mikko H. Lipasti"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pentium Pro Case Study Prof. Mikko H. Lipasti

Similar presentations

Presentation on theme: "Pentium Pro Case Study Prof. Mikko H. Lipasti"— Presentation transcript:

Similar presentations

About project

Feedback