Presentation on theme: "Advanced CISC Implementations: Pentium 4"— Presentation transcript:
1Advanced CISC Implementations: Pentium 4 Asanovic/DevadasSpring 20026.823Advanced CISC Implementations: Pentium 4Krste AsanovicLaboratory for Computer ScienceMassachusetts Institute of Technology
2Intel Pentium Pro (1995) During decode, translate complex x86 Asanovic/DevadasSpring 20026.823Intel Pentium Pro (1995)During decode, translate complex x86instructions into RISC-like micro-operations(uops)– e.g., “R R op Mem” translates intoload T, Mem # Load from Mem into temp regR R op T # Operate using value in tempExecute uops using speculative out-of-ordersuperscalar engine with register renamingPentium Pro family architecture (P6 family)used on Pentium-II and Pentium-III processors
3Intel Pentium 4 (2000) Deeper pipelines than P6 family Asanovic/DevadasSpring 20026.823Intel Pentium 4 (2000)Deeper pipelines than P6 family– about half as many levels of logic per pipeline stage as P6Trace cache holds decoded uops– only has a single x86->uop decoderDecreased latency in same process technology– aggressive circuit design– new microarchitectural tricksThis lecture contains figures and data taken from: “The microarchitecture of the Pentium 4 processor”, Intel Technology Journal, Q1, 2001
4Integer and FP Execution Units BTB/Branch Prediction Pentium 4 Block DiagramAsanovic/DevadasSpring 20026.823System BusBus UnitLevel 1 Data CacheLevel 2 CacheExecution UnitsMemory SubsystemInteger and FP Execution UnitsOut-of-orderexecutionlogicTrace CacheMicrocode ROMFetch/DecodeRetirementBranch History UpdateBTB/Branch PredictionFront EndOut-of-order Engine
5Asanovic/DevadasSpring 2002 6.823P-III vs. P-4 PipelinesBasic Pentium lll Processor Misprediction PipelineFetchFetchDecodeDecodeDecodeRenameROB RdRdy/ SchDispatchExecBasic Pentium 4 Processor Misprediction PipelineTC Nxt IPTC FetchDriveAllocRenameQueSchSchDispDispRFRFExFlgsBr CkDriveIn same process technology, ~1.5x clock frequencyPerformance Equation:Time = Instructions * Cycles * TimeProgram Program Instruction Cycle
6Apple Marketing for G4 Shorter data pipeline Asanovic/DevadasSpring 20026.823Apple Marketing for G4Shorter data pipelineThe performance advantage of the PowerPC G4 starts with its data pipeline. The term “processor pipeline” refers to the number of processing steps, or stages, it takes to accomplish a task. The fewer the steps, the shorter — and more efficient — the pipeline. Thanks to its efficient 7-stage design (versus 20 stages for the Pentium 4 processor) the G4 processor can accomplish a task with 13 fewer steps than the PC. You do the math.
7Relative Frequency of Intel Designs Asanovic/DevadasSpring 20026.823Relative Frequency of Intel DesignsRelative FrequencyOver time, fewer logic levels per pipeline stage andmore advanced circuit designHigher frequency in same process technology
8Deep Pipeline Design Greater potential throughput but: Asanovic/DevadasSpring 20026.823Deep Pipeline DesignGreater potential throughput but:Clock uncertainty and latch delays eat into cycle timebudget– doubling pipeline depth gives less than twice frequencyimprovementClock load and power increases– more latches running at higher frequenciesMore complicated microarchitecture needed to coverlong branch mispredict penalties and cache misspenalties– from Little’s Law, need more instructions in flight to cover longerlatencies larger reorder buffersP-4 has three major clock domains– Double pumped ALU (3 GHz), small critical area at highest speed– Main CPU pipeline (1.5 GHz)– Trace cache (0.75 GHz), save power
9Pentium 4 Trace Cache Holds decoded uops in predicted program flow Asanovic/DevadasSpring 20026.823Holds decoded uops in predicted program floworder, 6 uops per lineCode in memorycmpCode packed in trace cache(6 uops/line)Trace cache fetches one 6 uop line every 2 CPU clock cycles (runs at½ main CPU rate)br T1...T1: subbr T2T2: mov subbr T3T3: add submovbr T4T4:cmp br T subbr T mov subbr T add submov br T T4:...
10Trace Cache Advantages Asanovic/DevadasSpring 20026.823Trace Cache AdvantagesRemoves x86 decode from branchmispredict penalty– Parallel x86 decoder took 2.5 cycles in P6, would be 5 cycles in P-4designAllows higher fetch bandwidth fetch for correctlypredicted taken branches– P6 had one cycle bubble for correctly predicted taken branches– P-4 can fetch a branch and its target in same cycleSaves energy– x86 decoder only powered up on trace cache refill
11Pentium 4 Front End L2 Cache Front End BTB Inst. Prefetch & TLB Asanovic/DevadasSpring 20026.823Pentium 4 Front EndL2 Cachex86 instructions,8 Bytes/cycleFetch BufferFront EndBTB(4K Entries)Inst. Prefetch& TLBSingle x86 instruction/cyclex86 Decoder4 uops/cycleTrace Cache Fill Buffer6 uops/lineTrace Cache (12K uops)Translation from x86 instructions to internal uops only happens on trace cache miss, one x86 instruction per cycle.Translations are cached in trace cache.
13P-III vs. P-4 Renaming 1 2 TC Next IP (BTB) 3 4 5 6 7 8 9 TC Fetch Asanovic/DevadasSpring 20026.823P-III vs. P-4 Renaming12TC Next IP(BTB)3456789TC FetchDriveAllocRenameQueueSchedule 1Schedule 2Schedule 3Dispatch 1Dispatch 2Register File 1Register File 2ExecuteFlagsBranch CheckDrive1011121314151617181920P-4 physical register file separated from ROB status. ROB entries allocated sequentially as in P6 family. One of 128 physical registers allocated from free list. No data movement on retire, only Retirement RATupdated.
14P-4 µOp Queues and Schedulers Asanovic/DevadasSpring 20026.82312TC Next IP(BTB)Allocated/Renamed uops3 uops/cycle3456789TC FetchDriveAllocUop queues are in-order within each queueMemory uopQueueArithmetic uop QueueRenameQueueSchedule 1Schedule 2Schedule 3Dispatch 1Dispatch 2Register File 1Register File 2ExecuteFlagsBranch CheckDriveFastScheduler(x2)MemorySchedulerGeneralSchedulerSimple FP Scheduler1011121314151617181920Ready uops compete for dispatch ports(Fast schedulers can each dispatch 2 ALUoperations per cycle)
15P-4 Execution Ports Schedulers compete for access to execution ports Asanovic/DevadasSpring 20026.823P-4 Execution PortsExecPort 0ExecPort 01Load PortStore PortALU(Double Speed)ALU(Double Speed)IntegerOperationFPexecuteMemoryLoadMemoryStoreFP MoveAdd/SubLogicStore DataBranchesShift/rotateFP/SSE-AddFP/SSE MulFP/SSE DivMMXFP/SSE MoveFP/SSE StoreFXCHAdd/SubAll loadsLEASW prefetchStore AddressSchedulers compete for access to execution portsLoads and stores have dedicated portsALUs can execute two operations per cyclePeak bandwidth of 6 uops per cycle– load, store, plus four double-pumped ALU operations
16P-4 Fast ALUs and Bypass Path Asanovic/DevadasSpring 20026.823P-4 Fast ALUs and Bypass PathRegister File and Bypass NetworkL1 DataCacheFast ALUs and bypass network runs at double speedAll “non-essential” circuit paths handled out of loop to reducecircuit loading (shifts, mults/divs, branches, flag/ops)Other bypassing takes multiple clock cycles
17P-4 Staggered ALU Design Asanovic/DevadasSpring 20026.823P-4 Staggered ALU DesignStaggers 32-bit add and flagcompare into three ½ cyclephases– low 16 bits– high 16 bits– flag checksBypass 16 bits around every½ cycle– back-back dependent 32-bit addsat 3GHz in 0.18µmL1 Data Cache accessstarts with bottom 16 bits asindex, top 16 bits used as tagcheck later
18Long delay from schedulers to load hit/miss Asanovic/DevadasSpring 20026.823P-4 Load Schedule Speculation12TC Next IP(BTB)3456789TC FetchDriveAllocRenameQueueSchedule 1Schedule 2Schedule 3Dispatch 1Dispatch 2Register File 1Register File 2Load Execute 1Load Execute 2Branch CheckDrive1011121314151617181920Long delay from schedulers to load hit/missP-4 guesses that load will hit in L1 and schedules dependent operations to use valueIf load misses, only dependentoperations are replayed
21Instruction Fetch Unit: Pentium-III Die PhotoAsanovic/DevadasSpring 20026.823ProgrammableExternal and BacksidePacked FP DatapathsInterrupt ControlClockBus LogicPage Miss Handler16KB4-way s.a. D$Integer DatapathsFloating-PointDatapathsMemory OrderBufferMemory Interface Unit (convert floats to/from memory format)MMX DatapathsRegister Alias TableAllocate entries(ROB, MOB, RS)Reservation256KB8-way s.a.Instruction Fetch Unit:16KB 4-way s.a. I-cacheStationBranchAddress CalcReorder Buffer(40-entry physical regfile + architect. regfile)MicroinstructionSequencerInstruction Decoders:3 x86 insts/cycle
22Scaling of Wire DelayAsanovic/DevadasSpring 20026.823Over time, transistors are getting relatively faster than long wires– wire resistance growing dramatically with shrinking widthand height– capacitance roughly fixed for constant length wire– RC delays of fixed length wire risingChips are getting bigger– P-4 >2x size of P-IIIClock frequency rising faster than transistor speed– deeper pipelines, fewer logic gates per cycle– more advanced circuit designs (each gate goes faster)⇒ Takes multiple cycles for signal to cross chip
23Visible Wire Delay in P-4 Design driving signals across chip! Asanovic/DevadasSpring 20026.823Visible Wire Delay in P-4 Design12TC Next IP(BTB)3456789TC FetchDriveAllocRenameQueueSchedule 1Schedule 2Schedule 3Dispatch 1Dispatch 2Register File 1Register File 2ExecuteFlagsBranch CheckDrive1011121314151617181920Pipeline stages dedicated to justdriving signals across chip!
24Instruction Set Translation Asanovic/DevadasSpring 20026.823Instruction Set TranslationConvert a target ISA into a host machine’s ISAPentium Pro (P6 family)– translation in hardware after instruction fetchPentium-4 family– translation in hardware at level 1 instruction cache refillTransmeta Crusoe– translation in software using “Code Morphing”