Presentation on theme: "1 CENG 450 Computer Systems and Architecture Lecture 13 Amirali Baniasadi"— Presentation transcript:
1 CENG 450 Computer Systems and Architecture Lecture 13 Amirali Baniasadi firstname.lastname@example.org
2 This Lecture zSuperscalar Hardware zP6 & P4 Microarchitectures
3 Instruction Buffers Integer register file Floating point register file Decode rename dispatch Floating point inst. buffer Integer address inst buffer Functional units Functional units and data cache Memory interface Reorder and commit Inst. buffe r Pre- decode Inst. Cache
4 Issue Buffer Organization z a) Single, shared queue b)Multiple queue; one per inst. type No out-of-order No Renaming No out-of-order inside queues Queues issue out of order
5 Issue Buffer Organization z c) Multiple reservation stations; (one per instruction type or big pool) z NO FIFO ordering z Ready operands, hardware available execution starts z Proposed by Tomasulo From Instruction Dispatch
6 Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination
7 Memory Hazard Detection Logic Address add & translation Address compare Load address buffer Store address buffer loads stores Hazard Control To memory Instruction issue
8 Summary zDynamic ILP yInstruction buffer xSplit ID into two stages one for in-order and other for out- of-order issue ySocreboard xout-of-order, doesn’t deal with WAR/WAW hazards yTomasulo’s algorithm xUses register renaming to eliminate WAR/WAW hazards yDynamic scheduling + precise state + speculation ySuperscalar
9 The P6 Microarchitecture zP6: Introduced in 1995 zBasis for Pentium Pro, Pentium 2 and Pentium 3 zDifferences: Instruction set extensions (MMX added to Pentium 2, SSE added to Pentium 3) z3 Instructions fetched/decoded every cycle. zInstructions are translated to uops. zUops: Risk instructions z zRegister renaming and ROB is used. zPipeline is 14 stages: 8 stages to fetch/decode/dispatch in-order. 3 stages to execute out-of-order 3 stages to commit
10 The P6 Microarchitecture zFunctional Units: zinteger unit, FP unit, branch unit, memory address unit. zRegister Renaming uses 40 physical registers, 20 reservation stations and a 40 entry ROB. zVoltage 2.9, Power 14 watt zDual Cavity Package, 0.6 micron process
11 The P6 Microarchitecture zCompared to Pentium (P5) zPipeline stage 14 vs. 5 z3-way vs. 2-way zFundamental goal: Solve the memory latency problem zMOB (Memory Ordering Buffer) makes sure that: zStores : Never reordered, Never Speculated. zLoads : Can Pass Loads/Stores (MOB-Memory Ordering Buffer) zForwarding and Bypassing happen.
12 Dynamic Scheduling in P6 z Q: How pipeline 1 to 17 byte 80x86 instructions? z P6 doesn’t pipeline 80x86 instructions z P6 decode unit translates the Intel instructions into 72-bit micro- operations (~ MIPS) z Sends micro-operations to reorder buffer & reservation stations z Many instructions translate to 1 to 4 micro-operations zComplex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro- operations
13 Dynamic Scheduling in P6 Parameter80x86microops Max. instructions issued/clock36 Max. instr. complete exec./clock5 Max. instr. commited/clock3 Window (Instrs in reorder buffer) 40 Number of reservations stations 20 Number of rename registers 40 No. integer functional units (FUs) 2 No. floating point FUs 1 No. SIMD Fl. Pt. Fus 1 No. memory Fus 1 load + 1 store
14 P6 Pipeline z8 stages are used for in-order instruction fetch, decode, and issue yTakes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations (uops) z3 stages are used for out-of-order execution in one of 5 separate functional units z3 stages are used for instruction commit Instr Fetch 16B /clk Instr Decode 3 Instr /clk Renaming 3 uops /clk Execu- tion units (5) Gradu- ation 3 uops /clk 16B6 uops Reserv. Station Reorder Buffer
16 Pentium III Die Photo zEBL/BBL - Bus logic, Front, Back zMOB - Memory Order Buffer zPacked FPU - MMX Fl. Pt. (SSE) zIEU - Integer Execution Unit zFAU - Fl. Pt. Arithmetic Unit zMIU - Memory Interface Unit zDCU - Data Cache Unit zPMH - Page Miss Handler zDTLB - Data TLB zBAC - Branch Address Calculator zRAT - Register Alias Table zSIMD - Packed Fl. Pt. zRS - Reservation Station zBTB - Branch Target Buffer zIFU - Instruction Fetch Unit (+I$) zID - Instruction Decode zROB - Reorder Buffer zMS - Micro-instruction Sequencer 1st Pentium III : 9.5 M transistors, 12.3 * 10.4 mm in 0.25-mi. with 5 layers of aluminum
22 P6 vs. AMD Althon zSimilar to P6 microarchitecture (Pentium III), but more resources zTransistors: PIII 24M v. Althon 37M zDie Size: 106 mm 2 v. 117 mm 2 zPower: 30W v. 76W zCache: 16K/16K/256K v. 64K/64K/256K zWindow size: 40 vs. 72 uops zRename registers: 40 v. 36 int +36 Fl. Pt. zBTB: 512 x 2 v. 4096 x 2 zPipeline: 10-12 stages v. 9-11 stages zClock rate: 1.0 GHz v. 1.2 GHz zMemory bandwidth: 1.06 GB/s v. 2.12 GB/s
23 Pentium 4 zKnown as NetBurst architecture zStill translate from 80x86 to micro-ops zP4 has better branch predictor, more FUs zInstruction Cache holds micro-operations vs. 80x86 instructions yno decode stages of 80x86 on cache hit ycalled “trace cache” (TC) zFaster memory bus: 400 MHz v. 133 MHz zCaches yPentium III: L1I 16KB, L1D 16KB, L2 256 KB yPentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB yBlock size: PIII 32B v. P4 128B; 128 v. 256 bits/clock
24 Pentium 4 features zClock rates: yPentium III 1 GHz v. Pentium IV 1.5 GHz y14 stage pipeline vs. 24 stage pipeline y42 Million transistors zALUs operate at 2X clock rate for many ops zRename registers: 40 vs. 128; Window: 40 v. 126 zBTB: 512 vs. 4096 entries (Intel: 1/3 improvement) zCan retire 3 uops per cycle. zBranch Predictor removes 1/3 of mispredicted branches compared to P6
25 Pentium, Pentium Pro, P4 Pipeline zPentium (P5) = 5 stages Pentium Pro, II, III (P6) = 10 stages (1 cycle ex) Pentium 4 (NetBurst) = 20 stages (no decode) From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00
26 Block Diagram of Pentium 4 Microarchitecture zBTB = Branch Target Buffer (branch predictor) zI-TLB = Instruction TLB, Trace Cache = Instruction cache (Delivers uops) zRF = Register File; AGU = Address Generation Unit z"Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00
27 Block Diagram of Pentium 4 Microarchitecture zMicro-op Queues: one for memory, one for non-memory operations. zRegister renaming: ROB is NOT used for register renaming. zDispatch bandwidth (6) exceeds front-end and retirement bandwidth (3) zALU operations are done twice as fast as the clock. Key: ALU bypass loop
28 Pentium 4 Microarchitecture zLongest latencies: Multiply 14, Divide 60 zLow-latency small 8K L1 cache, medium latency large 256 L2 cache zStore to Load Forwarding: Pending Loads use Pending Stores before the stores have happened.
29 Pentium 4 Die Photo z42M Xtors yPIII: 26M z217 mm 2 yPIII: 106 mm 2 zL1 Execution Cache yBuffer 12,000 Micro-Ops z8KB data cache z256KB L2$
30 Benchmarks: Pentium 4 v. PIII v. Athlon zSPECbase2000 yInt, P4@1.5 GHz: 524, PIII@1GHz: 454, AMD Athlon@1.2Ghz:? yFP, P4@1.5 GHz: 549, PIII@1GHz: 329, AMD Athlon@1.2Ghz:304 zWorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better) yP4 : 164, PIII : 167, AMD Athlon: 180 zQuake 3 Arena: P4 172, Athlon 151 zSYSmark 2000 composite: P4 209, Athlon 221 zOffice productivity: P4 197, Athlon 209 zS.F. Chronicle 11/20/00: "… the challenge for AMD now will be to argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Athlon in clock speed."
31 Why? zInstruction count is the same for x86 zClock rates: P4 > Athlon > PIII zHow can P4 be slower? zTime = Instruction count x CPI x 1/Clock rate zAverage Clocks Per Instruction (CPI) of P4 must be worse than Athlon, PIII
32 Readings & Homework zReadings zDownload papers from the website: P6 and P4.