Out-of-order execution -P6 Lihu Rappoport, 12/2004 1 MAMAS – Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor.

Slides:



Advertisements
Similar presentations
Lecture 9 – OOO execution © Avi Mendelson, 5/ MAMAS – Computer Architecture Lecture 9 – Out Of Order (OOO) Dr. Avi Mendelson Some of the slides.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
CSCI 4717/5717 Computer Architecture
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 P6 Architecture Electronic Computers LM. 2 PIPELINE Between the three main sections compensation queues are inserted. The machine instructions are rotated.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lihu Rappoport and Adi Yoaz
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Chapter 12 Pipelining Strategies Performance Hazards.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Computer Architecture Out-of-order execution
Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Computer Structure 2014 – P6 uArch 1 Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
Computer Architecture 2011– P6 uArch (lec 8-9) 1 Computer Architecture The P6 Microarchitecture An Example of an Out-Of-Order Micro-processor By Dan Tsafrir,
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
1 P6 Architecture Computer architecture M. 2 PIPELINE Between the three main sections compensation queues are inserted. The machine instructions are rotated.
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
Computer Structure 2015 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CS203 – Advanced Computer Architecture ILP and Speculation.
Computer Structure 2012 – P6 uArch 1 Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi.
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Smruti R. Sarangi IIT Delhi
PowerPC 604 Superscalar Microprocessor
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
MAMAS – Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Dr. Lihu Rappoport.
OOO Execution of Memory Operations
OOO Execution of Memory Operations
Lihu Rappoport and Adi Yoaz
PIII Data Stream Power Saving Modes Buses Memory Order Buffer
Lecture 6: Advanced Pipelines
The Microarchitecture of the Pentium 4 processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
Alpha Microarchitecture
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
Adapted from the slides of Prof
Computer Structure Out-Of-Order Execution
Handling Stores and Loads
Presentation transcript:

out-of-order execution -P6 Lihu Rappoport, 12/ MAMAS – Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Dr. Lihu Rappoport

out-of-order execution -P6 Lihu Rappoport, 12/ The P6 family  Pentium® Pro (1995) –150~200 Hz, 512K L2  Pentium® II (1997) –233~450 MHz, 512K L2 –MMX TM Technology  Pentium® III (1999) –Up to 1.4GHz 0.13μ) –512K L2 (on die, full speed) –MMX TM + SSE  Pentium® III Xeon TM –Up to 900MHz 0.18μ) –Up to 2MB L2 cache  Celeron® (1998) –Up to 1.4GHz 0.13μ) –256K L2 on die –Moved to a P4 based core (≥1.7GHz)

out-of-order execution -P6 Lihu Rappoport, 12/ P6 Features  Dynamic Execution - combination of : –Out Of Order execution: Data flow analysis –Register renaming –Speculative Execution –Multiple Branch prediction  Super pipeline: 12 pipe stages

out-of-order execution -P6 Lihu Rappoport, 12/  In-Order Front End –BIU: Bus Interface Unit –IFU: Instruction Fetch Unit (includes IC) –BTB: Branch Target Buffer –ID: Instruction Decoder –MIS: Micro-Instruction Sequencer –RAT: Register Alias Table  Out-of-order Core –ROB: Reorder Buffer –RRF: Real Register File –RS: Reservation Stations –IEU: Integer Execution Unit –FEU: Floating-point Execution Unit –AGU: Address Generation Unit –MIU: Memory Interface Unit –DCU: Data Cache Unit –MOB: Memory Order Buffer –L2: Level 2 cache  In-Order Retire P6  Arch MIS AGU MOB External Bus IEU MIU FEU BTB BIU IFU I D RAT R S L2 DCU ROB

out-of-order execution -P6 Lihu Rappoport, 12/ O1O3 R1R2 Ex I1I2I3I4I5I6I7I8 Next IP Reg Ren RS Wr IcacheDecode RS disp Retirement In-Order Front End Out-of-order Core In-order Retirement 1: Next IP 2: ICache lookup 3: ILD (instruction length decode) 4: rotate 5: ID1 6: ID2 7: RAT- rename sources, ALLOC-assign destinations 8: ROB-read sources RS-schedule data-ready uops for dispatch 9: RS-dispatch uops 10:EX 11:Retirement P6 Pipeline

out-of-order execution -P6 Lihu Rappoport, 12/ MIS AGU MOB External Bus IEU MIU FEU ROB BTB BIU IFU I D RAT R S L2 DCU In-Order Front End  BTB: predicts the address of the next instruction to be fetched  IFU: fetches 16 bytes per cycle from the instruction cache –L2, or memory in case of IC miss  ID: Decodes instructions into uops –up to 6 uops/cycle  MIS: Produces uops for complex instructions –Instruction which decode into >4 uops  RAT: Register Alias Table

out-of-order execution -P6 Lihu Rappoport, 12/ Branch Prediction  Implementation –Use local history to predict direction –Need to predict multiple branches  Need to predict branches before previous branches are resolved  Branch history updated first based on prediction, later based on actual execution (speculative history) –Target address taken from BTB  Prediction rate: ~92% –~60 instructions between mispredictions –High prediction rate is very crucial for long pipelines –Especially important for OOOE, speculative execution:  On misprediction all instructions following the branch in the instruction window are flushed  Effective size of the window is determined by prediction accuracy  RSB used for Call/Return pairs

out-of-order execution -P6 Lihu Rappoport, 12/ Branch Prediction - Clustering  A cluster is a 16 byte aligned memory block = fetch line  The predictor needs to provide a prediction for the entire fetch line: –Predict the first taken branch in the line, following the fetch IP  Implemented by –Splitting IP into offset within line, set, and tag –If the tag of more than one way matches the fetch IP  The offsets of the matching ways are ordered  Ways with offset smaller than the fetch IP offset are discarded  The first branch that is predicted taken is chosen as the predicted branch Jump into the fetch line Jump out of the line jmp Predicted not taken Predicted taken

out-of-order execution -P6 Lihu Rappoport, 12/ The P6 BTB  2-level, local histories, per-set counters  4-way set associative: 512 entries in 128 sets IP Tag Hist 1001 Pred= msb of counter Way 0 Target Way 2 Way counters 128 sets PTV LRR 2 Per-Set Branch Type 00- cond 01- ret 10- call 11- uncond Return Stack Buffer Way 1 Prediction bit 4 ofst  Up to 4 branches can have a tag match

out-of-order execution -P6 Lihu Rappoport, 12/ MIS AGU MOB ExternalBus IEU MIU FEU ROB BTB BIU IFU I D RAT R S L2 DCU Alignment D0D0 1 uop Determine where each IA instruction starts In-Order Front End: Decoder Instruction Length Decode 16 Instruction bytes from IFU D1D1 D2D2 IDQ Direct the bytes of each inst. to the decoder Convert inst. Into uops If inst aligned with dec1/2 decodes into >1 uops, defer it to next cycle 1 uop  4 uops Buffers up to 6 uops: Smooth decoder’s variable throughput

out-of-order execution -P6 Lihu Rappoport, 12/ Micro Operations (Uops)  Each “CISC” inst is broken into one or more “RISC” uops –Simplicity:  Each uop is (relatively) simple  Canonical representation of src/dest (2 src, 1 dest) –Increased ILP  e.g., pop eax becomes esp1<-esp0+4, eax1<-[esp0]  Simple instructions translate to a few uops –Typical uop count (it is not necessarily cycle count!) Reg-Reg ALU/Mov inst:1 uop Mem-Reg Mov (load)1 uop Mem-Reg ALU(load + op)2 uops Reg-Mem Mov (store)2 uops (st addr, st data) Reg-Mem ALU(ld + op + st)4 uops  Complicated instructions need ucode

out-of-order execution -P6 Lihu Rappoport, 12/ Out-of-order Core MIS AGU MOB External Bus IEU MIU FEU BTB BIU IFU I D RAT R S L2 DCU ROB  ROB: Mechanism for renaming and retirement –40 entries that hold instructions in-order.  RS: pool of all “not yet executed” instructions (up to 20)  Execution Units –IEU: Integer Execution Unit –FEU: Floating-point Execution Unit  Memory related units –AGU: Address Generation Unit MIU: Memory Interface Unit –DCU: Data Cache Unit –MOB: Orders Memory operations –L2: Level 2 cache

out-of-order execution -P6 Lihu Rappoport, 12/ MIS AGU MOB ExternalBus IEU MIU FEU ROB BTB BIU IFUID RAT R S L2 DCU SHF FMU FDI V IDI V FA U IEU JEU IEU AGU Port 0 Port 1 Port 2 Port 3,4 Load Address Store Address Out-of-order Core: Execution Units

out-of-order execution -P6 Lihu Rappoport, 12/ Out Of Order Execution  Reservation station –20 entries, schedule, dispatch, bypass  Execution ports –5 ports, each one serving one or more Execution units  port 0: Integer, FP, SIMD, Shift, DIV, Pfmul, PFlogic  port 1: Integer, Jmp, SIMD, PFadd, PFlogic, Shuffle  port 2: Load address, Load data  port 3: Store address  port 4: Store data –All the units on a port share the same WB bus  Reorder buffer –40 entries, RRF, In order, 2 read ports  MOB

out-of-order execution -P6 Lihu Rappoport, 12/ Alloc & Rat  Perform register allocation and renaming for ≤3 uops/cyc  The Register Alias Table (RAT) –Maps the architectural registers into the physical registers  For each arch reg, holds number of latest phy reg that updates it  When a new uop that writes to a arch reg R is allocated, the RAT records the phy reg allocated to the uop as the latest reg that updates R  The Allocator (Alloc) –Assigns each uop an entry number in the ROB / RS / MOB –For each one of the sources (architectural registers) of the uop  Lookup the RAT to find out the latest phy reg updating it  Write it up in the RS entry –Allocate Load & Store buffers in the MOB EAX0RRF EBX19ROB ECX23ROB

out-of-order execution -P6 Lihu Rappoport, 12/ Re-order Buffer (ROB)  Hold 40 uops which are not yet committed –At the same order as in the program  Provide a large physical register space for register renaming –One physical register per each ROB entry  physical register number = entry number  Each uop has only one destination  Buffer the execution results until retirement –Valid data is set after uop executed and result written to physical reg #entryData Valid Physical Reg Data Architectural dest. reg 0V12HEBX 1V33HECX 39IxxxXXX

out-of-order execution -P6 Lihu Rappoport, 12/ RRF – Real Register File  Holds the Architectural Register File –Architectural Register are numbered: 0 – EAX, 1 – EBX, …  The value of an architectural register –is the value written to it by the last instruction committed which writes to this register RRF: #entryArch Reg Data 0 (EAX)9AH 1 (EBX)F34H

out-of-order execution -P6 Lihu Rappoport, 12/ Uop flow through the ROB  Uops are entered in order –Registers renamed by the entry number  Once assigned: execution order unimportant  After execution: –entries marked “executed” and wait for retirement –executed entry can be “retired” once all prior instruction have retired –Commit architectural state only after speculation (branch, exception) has resolved  Retirement –Detect exceptions and mispredictions  Initiate repair to get machine back on right track –Update “real registers” with value of renamed registers –Update memory –Leave the ROB

out-of-order execution -P6 Lihu Rappoport, 12/ Reservation station (RS)  Pool of all “not yet executed” uops (up to 20) –Holds the uop attributes and the uop source data until it is dispatched  When a uop is allocated in RS, operand values are updated –If operand is from an architectural register, value is taken from the RRF –If operand is from a phy reg, with data valid set, value taken from ROB –If operand is from a phy reg, with data valid not set, wait for value  The RS maintains operands status “ready/not-ready” –Each cycle, executed uops make more operands “ready”  The RS arbitrate the WB busses between the units  The RS monitors the WB bus to capture data needed by awaiting uops  Data can be bypassed directly from WB bus to execution unit –Uops whose all operands are ready can be dispatched for execution  Dispatcher chooses which of the ready uops to execute next  Dispatches chosen uops to functional units

out-of-order execution -P6 Lihu Rappoport, 12/ Register Renaming example IDQ Add EAX, EBX, EAX ALLOC EAX0RRF EBX19ROB ECX23ROB Add EAX, ROB19, RRF00 RAT  EAX37ROB EBX19ROB ECX23ROB Add ROB37, ROB19, RRF0 ROB  19V12HEBX 23V33HECX 37IxxxXXX 19V12HEBX 23V33HECX 37IxxxEAX src1src2Pdst add97H12H37 RRF: 0EAX97H RS

out-of-order execution -P6 Lihu Rappoport, 12/ In-Order Retire MIS AGU MOB External Bus IEU MIU FEU BTB BIU IFU I D RAT R S L2 DCU ROB  ROB: –Retires up to 3 uops per clock –Copies the values to the RRF –Retirement is done In-order –Performs exception checking

out-of-order execution -P6 Lihu Rappoport, 12/ In-order Retirement  The process of committing the results to the architectural state of the processor  Retires up to 3 uops per clock  Copies the values to the RRF  Retirement is done In Order  Performs exception checking  An instruction is retired after the following checks –Instruction has executed –All previous instructions have retired –Instruction isn’t mis-predicted –no exceptions

out-of-order execution -P6 Lihu Rappoport, 12/ Flow of Uops  DECODE: –Decoders translate instructions into uops (1 to 6 uops per cycle)  ISSUE: –ALLOC unit allocates one entry per uop in the RS and in the ROB (for up to 3 uops per cycle)  If source data is available from the ROB (either from the RRF of from the Result Buffer (RB) it is written in the RS entry  Otherwise, it is marked invalid in the RS (and should be captured from the WB bus)  READY/SCHEDULE: –Check for data-ready uops if desired functional unit available –Upto 5 resource-ready uops selected, and dispatched per clock  DISPATCH: –Ship scheduled uops to appropriate functional unit (RS)

out-of-order execution -P6 Lihu Rappoport, 12/ Flow of Uops (cont)  WRITEBACK: –Capture results returned by the functional units in a result buffer (ROB) –Snoop result writeback ports for results that are sources to uops in RS –Update data-ready status of these uops (RS)  RETIREMENT: –3 consecutive entries read out of the RB  these entries are candidates for retirement –Algorithm to determine fitness for retirement: candidate is retired if  its ready bit is set  it will not cause an exception  all preceding candidates are eligible for retirement –Commit results from result buffer to architecturally visible state in original “Issue” order –clear machine and restart execution if “badness” occurs (ROB)

out-of-order execution -P6 Lihu Rappoport, 12/ Large ROB and RS are Important  Large RS –Increases the window in which looking for impendent instructions  Exposes more parallelism potential  Large ROB –The ROB is a superset of the RS  ROB size ≥ RS size –Allows for of covering long latency operations (cache miss, divide)  Example –Assume there is a Load that misses the L1 cache  Data takes ~10 cycles to return  ~30 new instructions get into the pipeline –Instructions following the Load cannot commit  Pile up in the ROB –Instructions independent of the load are executed, and leave the RS  As long as the ROB is not full, we can keep executing instructions –A 40 entry ROB can cover for an L1 cache miss  Cannot cover for an L2 cache miss, which is hundreds of cycles

out-of-order execution -P6 Lihu Rappoport, 12/ P6 Caches  Blocking Caches –A cache miss prevents from other cache requests (which could possibly be hits) to be served –Makes OOO execution much less beneficial (cannot hide misses)  Both L1 and L2 cache in the P6 are non-blocking –Initiate the actions necessary to return data to cache miss while they respond to subsequent cached data requests –Support up to 4 outstanding misses  Misses translate into outstanding requests on the P6 bus  The bus can support up to 8 outstanding requests  The cache “squashes” (suspends) subsequent requests for the same missed cache line –Squashed requests are not counted in the number of outstanding requests –Once the engine has executed beyond the 4 outstanding requests, subsequent load requests are placed in the (12 deep) load buffer

out-of-order execution -P6 Lihu Rappoport, 12/ Memory Operations  Two types of memory access: loads and stores  Loads –Loads need to specify  memory address to be accessed  width of data being retrieved  the destination register –Loads are encoded into a single uop  Stores –Stores need to provide  a memory address, a data width, and the data to be written –Stores require two uops:  One to generate the address  One to generate the data –These uops are scheduled independently to maximize concurrency  must re-combine in the store buffer for the store to complete

out-of-order execution -P6 Lihu Rappoport, 12/ The Memory Execution Unit  As a black box, the MEU is just another execution unit –Receives memory reads and writes from the RS –Returns data back to the RS and to the ROB  Unlike many execution units, the MEU has side effects –May cause a bus request to external memory  The RS dispatches memory uops –When all the data used for address calculation is ready –Both to the MOB and to the Address Generation Unit (AGU) are free –The AGU computes the linear address: Segment-Base + Base-Address + (Scale*Index) + Displacement  Sends linear address to MOB, where it is stored with the uop in the appropriate Load Buffer or Store Buffer entry

out-of-order execution -P6 Lihu Rappoport, 12/ The Memory Execution Unit (cont.)  The RS operates based on dataflow dependencies –But cannot detect memory dependencies, even as simple as the following: i1: movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx i2: movl %eax, -4(%ebp) # eax ← MEM[ebp-4] –The RS may dispatch these operations in any order –If the MEU blindly executed the above operations, then eax may get the wrong value –It is the job of the MEU to detect violations of memory ordering, and prevent stale data from returning to the core  Could require memory operations dispatch non-speculatively –Would be slow and severely penalize all other operations in the machine waiting for memory read return data  MEU allows speculative execution as much as possible

out-of-order execution -P6 Lihu Rappoport, 12/ OOO Execution of Memory Operations  Stores are not executed OOO –Stores are never performed speculatively  there is no transparent way to undo them –Stores are also never re-ordered among themselves  The Store Buffer dispatches a store only when the store has both its address and its data, and there are no older stores awaiting dispatch  Resolving memory dependencies: memory disambiguation –Some memory dependencies can be resolved statically  store r1,a  load r2,b –Problem: some cannot  store r1,[r3];  load r2,b  can advance load before store  load must wait till r3 is known

out-of-order execution -P6 Lihu Rappoport, 12/ Performance Impact of Forcing In-order Memory Operations  x86 has small register set  uses memory often  Studied the importance of memory access reordering. Basic conclusions: –Preventing Stores from passing Stores/Loads  3%~5% performance loss  P6 chooses not allow Stores to pass Stores/Loads –Preventing Loads from passing Loads/Stores  Big performance loss  Need to allow loads to pass stores, and loads to pass loads

out-of-order execution -P6 Lihu Rappoport, 12/ Memory Order Buffer (MOB)  Every memory uop is allocated an entry in order –De-allocated when retires  Address & data (for stores), are updated when known  Load is checked against all previous stores –Waits if store to same address exist, but data not ready  If store data exists, just use it –Waits till all previous store addresses are resolved  In case of no address collision – go to memory

out-of-order execution -P6 Lihu Rappoport, 12/ Load Execution  Load is dispatched into memory pipe after dispatch from the RS  If it misses the DCU, it is dispatched by the BUS unit  After the data returns from the L2 or the bus –the load signals as complete (Load WB valid) –eligible to retire L2 Bus DCU MOB ROB RS Mem Data Req Data Return Data Req Bus Req Mem Load and FB Request Load WB Valid Port 2 RS dispatch Data Return

out-of-order execution -P6 Lihu Rappoport, 12/ Pentium® III: Senior Load Implementation  If a load misses the cache its retirement is delayed –Instructions following the load are executed, but cannot retire until the load retires  Even if the instructions are independent of the load data –These instructions accumulate inside the machine –Eventually the ROB is saturated, and the front-end is stalled  Pentium® III removed this bottleneck for Prefetch instructions –Instructions following a Prefetch are allowed to retire before the Prefetch returns data  Prefetch signal readiness for retirement much earlier than Load –Completion signaled almost immediately after allocation into the MOB (Pref WB Valid) –Completion not delayed until the data is actually fetched from the memory subsystem

out-of-order execution -P6 Lihu Rappoport, 12/ Senior load implementation L2 Bus DCU MOB ROB RS Mem Data Req Data Return Data Req Bus Req Mem Load and FB Request Pref WB Valid Port 2 RS dispatch Data Return

out-of-order execution -P6 Lihu Rappoport, 12/ Jump Misprediction – Flush at Retire  Flushing the pipe when the mis-predicted jump retires –retirement is in-order  all the instructions preceding the jump have already retired  all the instructions remaining in the pipe follow the jump  flush all the instructions in the pipe  Disadvantage –Mis-prediction is known after the jump was executed, but we continue fetching instructions from the wrong path until the branch retires

out-of-order execution -P6 Lihu Rappoport, 12/ Jump Misprediction – Flush at Execute  When the JEU detects jump misprediction it –Flush the in-order front-end –Start fetching and decoding from the “correct” path  The “correct” path still be wrong A preceding uop that hasn’t executed may cause an exception A preceding jump executed OOO can also mispredict –The “correct” instruction stream is stalled at the RAT  The RAT was wrongly updated also by wrong path instruction  When the mispredicted branch retires –Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.)  Only instruction following the jump are left – they must all be flushed  Reset the RAT to point only to architectural registers –Un-stalls the in-order machine –RS gets uops from RAT and starts scheduling and dispatching them

out-of-order execution -P6 Lihu Rappoport, 12/ Interrupt Handling  Complications for pipelines: –Interrupts occur in the middle of an instruction –Must restart interrupting and subsequent instructions  Precise interrupts preserve the model that instructions execute in program order –Identify the instruction that caused the interrupt –Instructions before the faulting instruction finish –Disable writes for faulting and subsequent instructions –Force trap instruction into pipeline –Trap routine –Save the state of the executing program –Correct the cause of the interrupt –Restore program state –Restart faulting and subsequent instructions

out-of-order execution -P6 Lihu Rappoport, 12/ P6 Tutorials  Pentium® Pro Processor Microarchitecture Overview  Optimizing for Intel® Pentium® II and Pentium Pro Processors

out-of-order execution -P6 Lihu Rappoport, 12/ Backup

out-of-order execution -P6 Lihu Rappoport, 12/ O1O3 R1R2 Ex I1I2I3I4I5I6I7I8 Next IP Reg Ren RS Wr IcacheDecode RS disp Retirement P6 Queuing RS scheduling Retirement scheduling Decoder queue (IDQ) Stores uops until they can enter the RAT/Allocator Smoothes the variable number of uops output by the decoder per cycle

out-of-order execution -P6 Lihu Rappoport, 12/ Register Renaming Helps with Small Number of Architectural Registers  x86 has a small number of general purpose registers –False dependencies can be caused by the need to reuse registers for unconnected reasons, i.e. MOV EAX, 17 ADD Mem, EAX MOV EAX, 3 ADD EAX, EBX –Solved by register renaming

out-of-order execution -P6 Lihu Rappoport, 12/ Streaming Stores  Applications can stream data from memory and use them just once –Using regular cache models will result in eviction of useful data from the cache –Many applications have a large working set (e.g., video and 3D), which doesn’t fit into the cache –In such a situation, it is better to bypasses the cache –Hence it should write directly to the memory: this is what streaming stores are for  Pentium III improved the write bandwidth by 20% –Pentium III now can saturate a 100 MHz bus with 800 MBs of writes –Done by removing a dead cycle between back to back write combining writes  Before the Pentium III, the architecture was mainly oriented to scalar applications –Fill buffers provide high instantaneous throughput caused by bursts of misses in scalar apps –Comparatively small average bandwidth requirements (~100 MB per second)  SSE applications requires high average bandwidth –Now the fill buffers need to sustain high average throughput  Fill buffers WC modified to allow all four buffers to be utilized for WC in parallel –Pentium II allows just one  The following WC eviction conditions policies added in Pentium III: –A buffer is evicted when all bytes are written (all dirty) to the fill buffer. –Previously the buffer eviction policy was “resource Demand” driven  i.e. a buffer gets evicted when DCU requests the allocation of new buffer –When all fill buffers are busy a DCU fill buffer allocation request, such as regular loads, stores, or prefetches requiring a fill buffer can evict a WC buffer even if it is not full yet