3Instruction Buffers Floating point register file Functional units Memory interfaceFloating point inst. bufferInst.CachePre-decodeInst.bufferDecoderenamedispatchFunctional units and data cacheInteger address inst bufferInteger register fileReorder and commit
4Issue Buffer Organization a) Single, shared queue b)Multiple queue; one per inst. typeNo out-of-orderNo RenamingNo out-of-order inside queuesQueues issue out of order
5Issue Buffer Organization c) Multiple reservation stations; (one per instruction type or big pool)NO FIFO orderingReady operands, hardware available execution startsProposed by TomasuloFrom Instruction Dispatch
6Typical reservation station Operation source data valid source 2 data 2 valid destination
8Summary Dynamic ILP Instruction buffer Split ID into two stages one for in-order and other for out-of-order issueSocreboardout-of-order, doesn’t deal with WAR/WAW hazardsTomasulo’s algorithmUses register renaming to eliminate WAR/WAW hazardsDynamic scheduling + precise state + speculationSuperscalar
9The P6 Microarchitecture P6: Introduced in 1995Basis for Pentium Pro, Pentium 2 and Pentium 3Differences: Instruction set extensions (MMX added to Pentium 2, SSE added to Pentium 3)3 Instructions fetched/decoded every cycle.Instructions are translated to uops.Uops: Risk instructionsRegister renaming and ROB is used.Pipeline is 14 stages: 8 stages to fetch/decode/dispatch in-order.3 stages to execute out-of-order3 stages to commit
10The P6 Microarchitecture Functional Units:integer unit, FP unit, branch unit, memory address unit.Register Renaming uses 40 physical registers, 20 reservation stations and a 40 entry ROB.Voltage 2.9, Power 14 wattDual Cavity Package, 0.6 micron process
11The P6 Microarchitecture Compared to Pentium (P5)Pipeline stage 14 vs. 53-way vs. 2-wayFundamental goal: Solve the memory latency problemMOB (Memory Ordering Buffer) makes sure that:Stores : Never reordered, Never Speculated.Loads : Can Pass Loads/Stores (MOB-Memory Ordering Buffer)Forwarding and Bypassing happen.
12Dynamic Scheduling in P6 Q: How pipeline 1 to 17 byte 80x86 instructions?P6 doesn’t pipeline 80x86 instructionsP6 decode unit translates the Intel instructions into 72-bit micro-operations (~ MIPS)Sends micro-operations to reorder buffer & reservation stationsMany instructions translate to 1 to 4 micro-operationsComplex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations
13Dynamic Scheduling in P6 Parameter 80x86 microopsMax. instructions issued/clock 3 6Max. instr. complete exec./clock 5Max. instr. commited/clock 3Window (Instrs in reorder buffer) 40Number of reservations stationsNumber of rename registersNo. integer functional units (FUs) No. floating point FUs No. SIMD Fl. Pt. Fus No. memory Fus load + 1 store
14Instr Decode 3 Instr /clk P6 Pipeline8 stages are used for in-order instruction fetch, decode, and issueTakes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations (uops)3 stages are used for out-of-order execution in one of 5 separate functional units3 stages are used for instruction commitInstr Fetch 16B /clkInstr Decode 3 Instr /clkRenaming 3 uops /clkExecu- tion units (5)Gradu- ation3 uops /clk16B6 uopsReserv. StationReorder Buffer
16Pentium III Die PhotoEBL/BBL - Bus logic, Front, BackMOB - Memory Order BufferPacked FPU - MMX Fl. Pt. (SSE)IEU - Integer Execution UnitFAU - Fl. Pt. Arithmetic UnitMIU - Memory Interface UnitDCU - Data Cache UnitPMH - Page Miss HandlerDTLB - Data TLBBAC - Branch Address CalculatorRAT - Register Alias TableSIMD - Packed Fl. Pt.RS - Reservation StationBTB - Branch Target BufferIFU - Instruction Fetch Unit (+I$)ID - Instruction DecodeROB - Reorder BufferMS - Micro-instruction SequencerFromStatistics0.25 micron 5-layer metal CMOS process technology9.5M transistors10.2 x 12.1 mm die size (excluding the etch ring)3-way superscalar out-of-order execution micro-architecture70 new streaming SIMD instructions:Comprehensive set of new SIMD-FP instruction setAdditional SIMD-integer MMX Technology instructionsNew memory streaming instructions (for FP & integer data types)Bottom left quadrantLogic for the front-end of the pipeline resides here.IFUInstruction Fetch Unit. Instruction fetch logic and a 16K Byte 4-way set-associative level oneinstruction cache resides in this block. Instruction data from the IFU is then forwarded to the ID.BTBBranch Target Buffer. This block is responsible for dynamic branch prediction based on thehistory of past branch decisions paths.BACBranch Address Calculator. Static branch prediction is performed here to handle the BTB misscase.TAPTestability Access Port. Various testability and debug mechanisms reside within this block.Bottom right quadrantInstruction decode, scheduling, dispatch, and retirement functionality is contained within thisquadrant.IDInstruction Decoder. This unit is capable of decoding up to 3 instructions per cycle.MSMicro-instruction Sequencer. This holds the microcode ROM and sequencer for more complexinstruction flows. The microcode update functionality is also located here.RSReservation Station. Micro-instructions and source data are held here for scheduling and dispatchto the execution ports. Dispatch can happen out-of-order and is dependent on source dataavailability and an available execution port.ROBRe-Order Buffer. This supports a 40-entry physical register file that holds temporary write-backresults that can complete out of order. These results are then committed to a separatearchitectural register file during in-order retirement.Top right quadrantThis primarily consists of the execution datapath for the Pentium® III processor.SIMDSIMD integer execution unit for MMX Technology instructions.MIUMemory Interface Unit. This is responsible for data conversion and formatting for floating pointdata types.IEUInteger Execution Unit. This is responsible for ALU functionality of scalar integer instructions.Address calculations for memory referencing instructions are also performed here along withtarget address calculations for jump related instructions.FAUFloating point Arithmetic Unit. This performs floating point related calculations for both existingscalar instructions along with support for some of the new SIMD-FP instructions.PFAUPacked Floating point Arithmetic Unit. This contains arithmetic execution data-path functionalityfor SIMD-FP specific instructions.Top left quadrantFunctionality in this quadrant is split into assorted functions including bus interface relatedfunctionality, data cache access, and allocation.ALLOCAllocator. Allocation of various resources such as ROB, MOB, and RS entries is performed hereprior to micro-instruction dispatch by the RS.RATRegister Alias Table. During resource allocation the renaming of logical to physical registers isperformed here.MOBMemory Order Buffer. Acts as a separate schedule and dispatch engine for data loads andstores. Also temporarily holds the state of outstanding loads and stores from dispatch untilcompletion.DTLBData Translation Look-aside Buffer. Performs the translation from linear addresses to physicaladdress required for support of virtual memory.PMHPage Miss Handler. Hardware engine for performing a page table walk in the event of a TLBmiss.DCUData Cache Unit. Contains the non-blocking 16K Byte 4-way set-associative level one data cachealong with associated fill and write back buffering.BBLBack-side Bus Logic. Logic for interface to the back-side bus for accesses to the external unifiedlevel two processor cache.EBLExternal Bus Logic. Logic for interface to the external front-side bus.PICProgrammable Interrupt Controller. Local interrupt controller logic for multi-processor interruptdistribution and boot-up communication.1st Pentium III : 9.5 M transistors, 12.3 * 10.4 mm in 0.25-mi. with 5 layers of aluminum
22P6 vs. AMD AlthonSimilar to P6 microarchitecture (Pentium III), but more resourcesTransistors: PIII 24M v. Althon 37MDie Size: 106 mm2 v. 117 mm2Power: 30W v. 76WCache: 16K/16K/256K v. 64K/64K/256KWindow size: 40 vs. 72 uopsRename registers: 40 v. 36 int +36 Fl. Pt.BTB: 512 x 2 v x 2Pipeline: stages v stagesClock rate: 1.0 GHz v. 1.2 GHzMemory bandwidth: 1.06 GB/s v GB/s
23Pentium 4 Known as NetBurst architecture Still translate from 80x86 to micro-opsP4 has better branch predictor, more FUsInstruction Cache holds micro-operations vs. 80x86 instructionsno decode stages of 80x86 on cache hitcalled “trace cache” (TC)Faster memory bus: 400 MHz v. 133 MHzCachesPentium III: L1I 16KB, L1D 16KB, L2 256 KBPentium 4: L1I 12K uops, L1D 8 KB, L2 256 KBBlock size: PIII 32B v. P4 128B; 128 v. 256 bits/clock
24Pentium 4 features Clock rates: Pentium III 1 GHz v. Pentium IV 1.5 GHz14 stage pipeline vs. 24 stage pipeline42 Million transistorsALUs operate at 2X clock rate for many opsRename registers: 40 vs. 128; Window: 40 v. 126BTB: 512 vs entries (Intel: 1/3 improvement)Can retire 3 uops per cycle.Branch Predictor removes 1/3 of mispredicted branches compared to P6
27Block Diagram of Pentium 4 Microarchitecture Micro-op Queues: one for memory, one for non-memory operations.Register renaming: ROB is NOT used for register renaming.Dispatch bandwidth (6) exceeds front-end and retirement bandwidth (3)ALU operations are done twice as fast as the clock. Key: ALU bypass loop
28Pentium 4 Microarchitecture Longest latencies: Multiply 14, Divide 60Low-latency small 8K L1 cache, medium latency large 256 L2 cacheStore to Load Forwarding: Pending Loads use Pending Stores before the stores have happened.
30Benchmarks: Pentium 4 v. PIII v. Athlon SPECbase2000Int, GHz: 524, 454, AMDFP, GHz: 549, 329, AMDWorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better)P4 : 164, PIII : 167, AMD Athlon: 180Quake 3 Arena: P4 172, Athlon 151SYSmark 2000 composite: P4 209, Athlon 221Office productivity: P4 197, Athlon 209S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Athlon in clock speed."
31Why? Instruction count is the same for x86 Clock rates: P4 > Athlon > PIIIHow can P4 be slower?Time = Instruction count x CPI x 1/Clock rateAverage Clocks Per Instruction (CPI) of P4 must be worse than Athlon, PIII
32Readings & Homework Readings Download papers from the website: P6 and P4.