Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Hsien-Hsin Sean Lee

Similar presentations

Presentation on theme: "Prof. Hsien-Hsin Sean Lee"— Presentation transcript:

1 ECE 4100/6100 Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture
Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 P6 System Architecture Host Processor P6 Core L2 Cache (SRAM)
Back-Side Bus Front-Side Bus GPU PCIExpress AGP Latest DDR2-667 : 10.7GB/sec PCIExpress x16: 8GB/sec (AGP 8x 2.1GB/s) FSB 8.5GB (1067MHz * 8) System Memory (DRAM) Graphics Processor MCH ICH chipset Local Frame Buffer PCI USB I/O

3 P6 Microarchitecture Bus interface unit AGU BTB/BAC FEU MIU
External bus Data Cache Unit (L1) Chip boundary Memory Cluster Bus Cluster Memory Order Buffer Bus interface unit AGU Instruction Fetch Unit Instruction Fetch Unit MMX IEU/JEU Control Flow IEU/JEU (Restricted) Data Flow BTB/BAC FEU Instruction Fetch Cluster MIU Instruction Decoder Register Alias Table Reservation Station Out-of-order Cluster Microcode Sequencer Allocator ROB & Retire RF Issue Cluster

4 Pentium III Die Map EBL/BBL – External/Backside Bus logic
MOB - Memory Order Buffer Packed FPU - Floating Point Unit for SSE IEU - Integer Execution Unit FAU - Floating Point Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit (L1) PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Floating Point unit RS - Reservation Station BTB - Branch Target Buffer TAP – Test Access Port IFU - Instruction Fetch Unit and L1 I-Cache ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer

5 P6 Basics One implementation of IA32 architecture
Deeply pipeline processor In-order front-end and back-end Dynamic execution engine (restricted dataflow) Speculative execution P6 microarchitecture family processors include Pentium Pro Pentium II (PPro + MMX + 2x caches) Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD) Pentium 4 (Not P6, will be discussed separately) Pentium M (+SSE2, SSE3, op fusion) Core (PM + SSSE3, SSE4, Intel 64 (EM64T), MacroOp fusion, 4 op retired rate vs. 3 of previous proliferation)

6 P6 Pipelining … …….. … In-order FE ROB Scheduling Delay
IFU1 IFU2 IFU3 DEC1 DEC2 RAT ROB DIS EX RET1 RET2 11 12 13 14 15 16 17 20 21 22 Next IP I-Cache ILD Rotate Dec1 Dec2 Br Dec RS Write RAT IDQ In-order FE FE in-order boundary Retirement in-order boundary 31 32 33 82 83 RS schd RS Disp Exec / WB Single-cycle pipeline 83: Data WB 82: Int WB schedule 81: Mem/FP WB …….. RS Scheduling Delay ROB Scheduling Delay MOB Scheduling Delay 31 32 33 81 82 .. 83 Exec2 Exec n Multi-cycle pipeline 31 32 33 81 82 42 43 83 AGU DCache1 DCache2 Non-blocking memory pipeline 31 32 33 81 82 42 43 83 AGU MOB blk MOB wr 40 41 MOB disp DCache1 Dcache2 MOB wakeup Blocking memory pipeline 91 92 93 Ret ptr wr Ret ROB rd RRF wr

7 Instruction Fetching Unit
data addr Other fetch requests Select mux Instruction buffer Streaming Buffer Length marks Instruction Cache ILD Instruction rotator Linear Address Victim Cache Next PC Mux P.Addr Instruction TLB #bytes consumed by ID Prediction marks Branch Target Buffer IFU1: Initiate fetch, requesting 16 bytes at a time IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction (2 cycles) IFU3: Align instructions to 3 decoders in format

8 Static Branch Prediction (stage 17 Br. Dec of pg. 6)
Unconditional PC-relative? BTB miss? No No Yes Yes PC-relative? Return? No No Indirect jump Yes Yes Conditional? No Yes BTB dynamic predictor’s decision Taken Backwards? No Taken Yes Taken Taken Not Taken Taken

9 Dynamic Branch Prediction
1 Branch History Register (BHR) 0000 0001 0010 1111 1110 Pattern History Tables (PHT) Prediction Rc: Branch Result 2-bit sat. counter 10 Spec. update New (spec) history 1101 W0 W1 W2 W3 512-entry BTB Similar to a 2-level PAs design Associated with each BTB entry W/ 16-entry Return Stack Buffer 4 branch predictions per cycle (due to 16-byte fetch per cycle) Speculative update (2 copies of BHR) Static prediction provided by Branch Address Calculator when BTB misses (see prior slide)

10 X86 Instruction Decode Next 3 inst #Inst to dec S,S,S 3 S,S,C First 2
Instruction Buffer Next 3 inst #Inst to dec S,S,S 3 S,S,C First 2 S,C,S First 1 S,C,C C,S,S C,S,C C,C,S C,C,C (16 bytes) complex (1-4) simple (1) simple (1) Micro-instruction sequencer (MS) Instruction decoder queue (6 ops) To RAT/ALLOC 4-1-1 decoder Decode rate depends on instruction alignment DEC1: translate x86 into micro-operation’s (ops) DEC2: move decoded ops to ID queue MS performs translations either Generate entire op sequence from the “microcode ROM” Receive 4 ops from complex decoder, and the rest from microcode ROM Subsequent Instructions followed by the inst needing MS are flushed S: Simple C: Complex

11 Register Alias Table (RAT)
Renaming Example Integer RAT Array Logical Src RRF PSrc EAX 25 EBX 2 Array Physical Src (Psrc) In-order queue Int and FP Overrides ECX 1 ECX EDX 15 RAT PSrc’s FP TOS Adjust FP RAT Array Allocator Physical ROB Pointers RRF ROB Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 op per cycle 40 80-bit physical registers embedded in the ROB (thereby, 6 bit to specify PSrc) RAT looks up physical ROB locations for renamed sources based on RRF bit Override logic is for dependent ops decoded at the same cycle Misprediction will revert all pointers to point to Retirement Register File (RRF)

12 Partial Stalls due to RAT
EAX AX write read MOV AX, m8 ; ADD EAX, m32 ; stall Partial register stalls CMP EAX, EBX INC ECX JBE XX ; stall Partial flag stalls (1) JBE reads both ZF and CF while INC affects (ZF,OF,SF,AF,PF) i.e. only ZF LAHF loads low byte of EFLAGS while TEST writes partial of them TEST EBX, EBX LAHF ; stall Partial flag stalls (2) XOR EAX, EAX MOV AL, m8 ; ADD EAX, m32 ; no stall SUB EAX, EAX Idiom Fix (1) Idiom Fix (2) Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read Because need to read different partial pieces from multiple physical registers ! Partial flags stalls: Occurs when a subsequent instruction reads more flags than a prior unretired instruction touches

13 Partial Register Width Renaming
INT Low Bank (32b/16b/L): 8 entries INT High Bank (H): entries Size(2) RRF(1) PSrc(6) Integer RAT Array Logical Src Array Physical Src In-order queue Int and FP Overries RAT Physical Src FP TOS Adjust FP RAT Array op0: MOV AL = (a) op1: MOV AH = (b) op2: ADD AL = (c) op3: ADD AH = (d) Allocator Physical ROB Pointers from Allocator 32/16-bit accesses: Read from low bank (AL/BL/CL/DL;AX/BX/CX/DX;EAX/EBX/ECX/EDX/EDI/ESI/EBP/ESP) Write to both banks (AH/BH/CH/DH) 8-bit RAT accesses: depending on which bank is being written and only update the particular bank

14 Allocator (ALLOC) The interface between in-order and out-of-order pipelines Allocates into ROB, MOB and RS “3-or-none” ops per cycle into ROB and RS Must have 3 free ROB entries or no allocation “all-or-none” policy for MOB Stall allocation when not all the valid MOB ops can be allocated Generate physical destination token Pdst from the ROB and pass it to the Register Alias Table (RAT) and RS Stalls upon shortage of resources

15 Reservation Stations (RS)
WB bus 0 Port 0 IEU0 Fadd Fmul Imul Div Pfmul WB bus 1 Port 1 IEU1 JEU Pfadd Pfshuf Loaded data Port 2 RS AGU0 Ld addr MOB LDA DCU STA Port 3 St addr AGU1 STD St data Port 4 ROB Retired data RRF Gateway to execution: binding max 5 op to each port per cycle Port binding at dispatch time (certain op can only be bound to one port) 20 op entry buffer bridging the In-order and Out-of-order engine (32 entries in Core) RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc. Oldest first FIFO scheduling when multiple ops are ready at the same cycle

16 ReOrder Buffer (ROB) A 40-entry circular buffer (96-entry in Core)
157-bit wide Provide 40 alias physical registers Out-of-order completion Deposit exception in each entry Retirement (or de-allocation) After resolving prior speculation Handle exceptions thru MS Clear OOO state when a mis-predicted branch or exception is detected 3 op’s per cycle in program order For multi-op x86 instructions: none or all (atomic) RS ALLOC RRF ROB RAT . . . (exp) code assist MS

17 Memory Execution Cluster
RS / ROB LD STA STD movl ecx, edi addl ecx, 8 movl -4(edi), ebx movl eax, 4(ecx) Load Buffer DTLB DCU LD STA RS cannot detect this and could dispatch them at the same time FB Store Buffer EBL Memory Cluster Manage data memory accesses Address Translation Detect violation of access ordering Fill buffers (FB) in DCU, similar to MSHR for non-blocking cache support

18 Memory Order Buffer (MOB)
Allocated by ALLOC A second order RS for memory operations 1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD) MOB 16-entry load buffer (LB) (32-entry in Core, 64 in SandyBridge) 12-entry store address buffer (SAB) (20-entry in Core, 36 in SandyBridge) SAB works in unison with Store data buffer (SDB) in MIU Physical Address Buffer (PAB) in DCU Store Buffer (SB): SAB + SDB + PAB Senior Stores Upon STD/STA retired from ROB SB marks the store “senior” Senior stores are committed back in program order to memory when bus idle or SB full Prefetch instructions in P-III Senior load behavior Due to no explicit architectural destination New Memory dependency predictor in Core to predict store-to-load dependencies

19 Store Coloring x86 Instructions op’s store color
mov (0x1220), ebx std ebx sta 0x mov (0x1110), eax std eax sta 0x mov ecx, (0x1220) ld 0x mov edx, (0x1280) ld 0x mov (0x1400), edx std edx sta 0x mov edx, (0x1380) ld 0x ALLOC assigns Store Buffer ID (SBID) in program order ALLOC tags loads with the most recent SBID Check loads against stores with equal or younger SBIDs for potential address conflicts SDB forwards data if conflict detected

20 Memory Type Range Registers (MTRR)
Control registers written by the system (OS) Supporting Memory Types UnCacheable (UC) Uncacheable Speculative Write-combining (USWC or WC) Use a fill buffer entry as WC buffer WriteBack (WB) Write-Through (WT) Write-Protected (WP) E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write. Page Miss Handler (PMH) Look up MTRR while supplying physical addresses Return memory types and physical address to DTLB

21 Intel NetBurst Microarchitecture
Pentium 4’s microarchitecture Original target market: Graphics workstations, but … Design Goals: Performance, performance, performance, … Unprecedented multimedia/floating-point performance Streaming SIMD Extensions 2 (SSE2) SSE3 introduced in Prescott Pentium 4 (90nm) Reduced CPI Low latency instructions High bandwidth instruction fetching Rapid Execution of Arithmetic & Logic operations Reduced clock period New pipeline designed for scalability

22 Innovations Beyond P6 Hyperpipelined technology
Streaming SIMD Extension 2 Hyper-threading Technology (HT) Execution trace cache Rapid execution engine Staggered adder unit Enhanced branch predictor Indirect branch predictor (also in Banias Pentium M) Load speculation and replay

23 Pentium 4 Fact Sheet IA-32 fully backward compatible
Available at speeds ranging from 1.3 to ~3.8 GHz Hyperpipelined (20+ stages) 125 million transistors in Prescott (1.328 billion in 16MB on-die L3 Tulsa, 65nm) 0.18 μ for 1.3 to 2GHz; 0.13μ for 1.8 to 3.4GHz; 90nm for 2.8GHz to 3.6GHz Die Size of 122mm2 (Prescott 90nm), 435mm2 (Tulsa 65nm), Consumes 115 watts of power at 3.6Ghz 1066MHz system bus Prescott L1 16KB, 8-way vs. previous P4’s 8KB 4-way 1MB, 512KB or 256KB 8-way full-speed on-die L2 (B/W example: 89.6 to L1) 2MB L3 cache (in P4 HT Extreme edition, 0.13μ only), 16MB in Tulsa 144 new 128 bit SIMD instructions (SSE2), and 16 SSSE instructions in Prescott HyperThreading Technology (Not in all versions)

24 Building Blocks of Netburst
System bus L1 Data Cache Bus Unit Level 2 Cache Execution Units Memory subsystem INT and FP Exec. Unit Fetch/ Dec ETC μROM OOO logic Retire BTB / Br Pred. Branch history update Out-of-Order Engine Front-end

25 Pentium 4 Microarchitectue (Prescott)
BTB (4k entries) I-TLB/Prefetcher 64 bits Code ROM 64-bit System Bus IA32 Decoder Trace Cache BTB (2k entries) Quad Pumped 800MHz 6.4 GB/sec BIU Execution Trace Cache (12K ops) op Queue Allocator / Register Renamer Memory op Queue INT / FP op Queue Memory scheduler Fast Slow/General FP scheduler Simple FP INT Register File / Bypass Network FP RF / Bypass Ntwk U-L2 Cache 1MB 8-way 128B line, WB 108 GB/s FP MMX SSE/2/3 FP Move AGU AGU 2x ALU 2x ALU Slow ALU Ld addr St addr Simple Inst. Simple Inst. Complex Inst. 256 bits L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)

26 Pipeline Depth Evolution
PREF DEC EXEC WB P5 Microarchitecture IFU1 IFU2 IFU3 DEC1 DEC2 RAT ROB DIS EX RET1 RET2 P6 Microarchitecture 20 stages TC NextIP TC Fetch Drive Alloc Queue Rename Schedule Dispatch Reg File Exec Flags Br Ck NetBurst Microarchitecture (Willamette) NetBurst Microarchitecture (Prescott) > 30 stages

27 Execution Trace Cache Primary first level I-cache to replace conventional L1 Decoding several x86 instructions at high frequency is difficult, take several pipeline stages Branch misprediction penalty is considerable Advantages Cache post-decode ops (think about fill unit) High bandwidth instruction fetching Eliminate x86 decoding overheads Reduce branch recovery time if TC hits Hold up to 12,000 ops 6 ops per trace line Many (?) trace lines in a single trace

28 Execution Trace Cache Deliver 3 op’s per cycle to OOO engine if br pred is good X86 instructions read from L2 when TC misses (7+ cycle latency) TC Hit rate ~ 8K to 16KB conventional I-cache Simplified x86 decoder Only one complex instruction per cycle Instruction > 4 op will be executed by micro-code ROM (P6’s MS) Perform branch prediction in TC 512-entry BTB + 16-entry RAS With BP in x86 IFU, reduce 33% misprediction compared to P6 Intel did not disclose the details of BP algorithms used in TC and x86 IFU (Dynamic + Static)

29 Out-Of-Order Engine Similar design philosophy with P6 uses Allocator
Register Alias Table 128 physical registers 126-entry ReOrder Buffer 48-entry load buffer 24-entry store buffer

30 Register Renaming Schemes
ROB (40-entry) NetBurst Register Renaming Status Allocated sequentially . . . Data EBX ECX EDX ESI EDI EAX ESP EBP Front-end RAT RF (128-entry) ROB (126) EBX ECX EDX ESI EDI EAX ESP EBP RAT EBX ECX EDX ESI EDI EAX ESP EBP Retirement RAT Allocated sequentially Data Status RRF P6 Register Renaming

31 Micro-op Scheduling op FIFO queues op schedulers
Memory queue for loads and stores Non-memory queue op schedulers Several schedulers fire instructions from 2 op queues to execution (P6’s RS) 4 distinct dispatch ports Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from ld/st ports) Exec Port 0 Exec Port 1 Load Port Store Port Fast ALU (2x pumped) FP Move INT Exec Memory Load Store Add/sub Logic Store Data Branches FP/SSE Move FP/SSE Store FXCH Shift Rotate FP/SSE Add FP/SSE Mul FP/SSE Div MMX Loads LEA Prefetch Stores

32 Data Memory Accesses Prescott: 16KB 8-way L1 + 1MB 8-way L2 (with a HW prefetcher), 128B line Load-to-use speculation Dependent instruction dispatched before load finishes Due to the high frequency and deep pipeline depth From load scheduler to execution is longer than execution itself Scheduler assumes loads always hit L1 If L1 miss, dependent instructions left the scheduler receive incorrect data temporarily – mis-speculation Replay logic Re-execute the load when mis-speculated Mis-speculated operations are placed into a replay queue for being re-dispatched All trailing independent instructions are allowed to proceed Tornado breaker Up to 4 outstanding load misses (= 4 fill buffers in original P6) Store-to-load forwarding buffer 24 entries Have the same starting physical address Load data size <= store data size

33 Fast Staggered ALU Bit[15:0] Bit[31:16] Flags For frequent ALU instructions (No multiply, no shift, no rotate, no branch processing) Double pumped clocks Each operation finishes in 3 fast cycles Lower-order 16-bit and bypass Higher-order 16-bit and bypass ALU flags generation

34 Branch Predictor P4 uses the same hybrid predictor of Pentium M
Bimodal Predictor Local Predictor Global Predictor Pred_B Pred_L L_hit MUX Pred_G G_hit MUX

35 Indirect Branch Predictor
In Pentium M and Prescott Pentium 4 Prediction based on global history

36 New Instructions over Pentium
CMOVcc / FCMOVcc r, r/m Conditional moves (predicated move) instructions Based on conditional code (cc) FCOMI/P : compare FP stack and set integer flags RDPMC/RDTSC instructions PMC: P6 has 2, Netburst (P4) has 18 Uncacheable Speculative Write-Combining (USWC) —weakly ordered memory type for graphics memory

37 New Instructions SSE2 in Pentium 4 (not p6 microarchitecture)
Double precision SIMD FP SSSE in Core 2 Supplemental instructions for shuffle, align, add, subtract. Intel 64 (EM64T) 64 bit support, new registers (8 more on top of 8) In Celeron D, Core 2 (and P4 Prescott, Pentium D) Almost compatible with AMD64 AMD’s NX bit or Intel’s XD bit for preventing buffer overflow attacks

38 Streaming SIMD Extension 2
P-III SSE (Katmai New Instructions: KNI) Eight 128-bit wide xmm registers (new architecture state) Single-precision 128-bit SIMD FP Four 32-bit FP operations in one instruction Broken down into 2 ops for execution (only 80-bit data in ROB) 64-bit SIMD MMX (use 8 mm registers — map to FP stack) Prefetch (nta, t0, t1, t2) and sfence P4 SSE2 (Willamette New Instructions: WNI) Support Double-precision 128-bit SIMD FP Two 64-bit FP operations in one instruction Throughput: 2 cycles for most of SSE2 operations (exceptional examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.) Enhanced 128-bit SIMD MMX using xmm registers

39 Examples of Using SSE X3 X2 X1 X0 xmm1 X3 X2 X1 X0 xmm1
Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, imm8) X3 X2 X1 X0 Y3 Y2 Y1 Y0 Y3 .. Y0 X3 .. X0 Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, 0xf1) X3 X2 X1 X0 Y3 Y2 Y1 Y0 xmm1 xmm2 Y3 Y2 Y1 Y0 xmm2 Y3 Y2 Y1 Y0 xmm2 op X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0 op xmm1 X3 X2 X1 xmm1 X0 op Y0 Packed SP FP operation (e.g. ADDPS xmm1, xmm2) Scalar SP FP operation (e.g. ADDSS xmm1, xmm2)

40 Examples of Using SSE and SSE2
xmm1 X3 X2 X1 X0 xmm1 Shuffle FP operation (e.g. SHUFPS xmm1, xmm2, imm8) X3 X2 X1 X0 Y3 Y2 Y1 Y0 Y3 .. Y0 X3 .. X0 Shuffle FP operation (8-bit imm) (e.g. SHUFPS xmm1, xmm2, 0xf1) X3 X2 X1 X0 Y3 Y2 Y1 Y0 xmm1 xmm2 Y3 Y2 Y1 Y0 xmm2 Y3 Y2 Y1 Y0 xmm2 op X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0 op xmm1 X3 X2 X1 xmm1 X0 op Y0 Packed SP FP operation (e.g. ADDPS xmm1, xmm2) Scalar SP FP operation (e.g. ADDSS xmm1, xmm2) SSE2 X1 X0 xmm1 X1 X0 xmm1 X1 X0 Y1 Y0 xmm2 Y1 Y0 xmm2 Y1 Y0 op op op X1 op Y1 X0 op Y0 xmm1 X1 X0 op Y0 xmm1 Y1 or Y0 X1 or X0 Packed DP FP operation (e.g. ADDPD xmm1, xmm2) Scalar DP FP operation (e.g. ADDSD xmm1, xmm2) Shuffle DP operation (2-bit imm) (e.g. SHUFPD xmm1, xmm2, imm2) Shuffle FP operation (e.g. SHUFPS xmm1, xmm2, imm8)

41 HyperThreading Intel Xeon Processor and Intel Xeon MP Processor
Enable Simultaneous Multi-Threading (SMT) Exploit ILP through TLP (—Thread-Level Parallelism) Issuing and executing multiple threads at the same snapshot Single P4 w/ HT appears to be 2 logical processors Share the same execution resources dTLB shared with logical processor ID Some other shared resources are partitioned (next slide) Architectural states and some microarchitectural states are duplicated IPs, iTLB, streaming buffer Architectural register file Return stack buffer Branch history buffer Register Alias Table

42 Multithreading (MT) Paradigms
Unused Execution Time FU1 FU2 FU3 FU4 Conventional Superscalar Single Threaded Fine-grained Multithreading (cycle-by-cycle Interleaving) Thread 2 Thread 3 Thread 4 Thread 5 Coarse-grained Multithreading (Block Interleaving) Chip Multiprocessor (CMP) or called Multi-Core Processors today Simultaneous Multithreading (or Intel’s HT)

43 HyperThreading Resource Partitioning
TC (or UROM) is alternatively accessed per cycle for each logical processor unless one is stalled due to TC miss op queue (into ½) after fetched from TC ROB (126/2) LB (48/2) SB (24/2) (32/2 for Prescott) General op queue and memory op queue (1/2) TLB (½?) as there is no PID Retirement: alternating between 2 logical processors

Download ppt "Prof. Hsien-Hsin Sean Lee"

Similar presentations

Ads by Google