Lihu Rappoport and Adi Yoaz

Lihu Rappoport and Adi Yoaz
Computer Architecture The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi Yoaz

The P6 Family Features Out Of Order execution Register renaming
Speculative Execution Multiple Branch prediction Super pipeline: 12 pipe stages Processor Year Freq (MHz) Bus (MHz) L2 cache Process (μ) Pentium® Pro 1995 150~200 60/66 256/512K* 0.5, 0.35 Pentium® II 1997 233~450 66/100 512K* 0.35, 0.25 Pentium® III 1999 450~1400 100/133 256/512K 0.25, 0.18, 0.13 Pentium® M 2003 900~2260 400/533 1M / 2M 0.13, 90nm CoreTM 2005 1660~2330 667 2M 65nm CoreTM 2 2006 1800~2930 800/1066 2/4/8M *off die

P6 Arch In-Order Front End External Bus Out-of-order Core
BIU: Bus Interface Unit IFU: Instruction Fetch Unit (includes IC) BPU: Branch Prediction Unit ID: Instruction Decoder MS: Micro-Instruction Sequencer RAT: Register Alias Table Out-of-order Core ROB: Reorder Buffer RRF: Real Register File RS: Reservation Stations IEU: Integer Execution Unit FEU: Floating-point Execution Unit AGU: Address Generation Unit MIU: Memory Interface Unit DCU: Data Cache Unit MOB: Memory Order Buffer L2: Level 2 cache In-Order Retire MS AGU MOB External Bus IEU MIU FEU BPU BIU IFU I D RAT R S L2 DCU ROB

P6 Pipeline I1 I2 I3 I4 I5 I6 I7 I8 Next IP Reg Ren RS Wr Icache
Decode RS disp In-Order Front End Ex 1: Next IP 2: ICache lookup 3: ILD (instruction length decode) 4: rotate 5: ID1 6: ID2 7: RAT- rename sources, ALLOC-assign destinations 8: ROB-read sources RS-schedule data-ready uops for dispatch 9: RS-dispatch uops 10:EX 11:Retirement Out-of-order Core O1 O3 Retirement In-order Retirement R1 R2

In-Order Front End Bytes Instructions uops
Next IP Mux BPU MS IFU ILD IQ ID IDQ BPU – Branch Prediction Unit – predict next fetch address IFU – Instruction Fetch Unit iTLB translates virtual to physical address (access PMH on miss) IC supplies 16byte/cyc (access L2 cache on miss) ILD – Induction Length Decode – split bytes to instructions IQ – Instruction Queue – buffer the instructions ID – Instruction Decode – decode instructions into uops MS – Micro-Sequencer – provides uops for complex instructions

Branch Prediction Implementation Prediction rate: ~92%
Use local history to predict direction Need to predict multiple branches Need to predict branches before previous branches are resolved Branch history updated first based on prediction, later based on actual execution (speculative history) Target address taken from BTB Prediction rate: ~92% ~60 instructions between mispredictions High prediction rate is very crucial for long pipelines Especially important for OOOE, speculative execution: On misprediction all instructions following the branch in the instruction window are flushed Effective size of the window is determined by prediction accuracy RSB used for Call/Return pairs

Branch Prediction – Clustering
Need to provide predictions for the entire fetch line each cycle Predict the first taken branch in the line, following the fetch IP Implemented by Splitting IP into offset within line, set, and tag If the tag of more than one way matches the fetch IP The offsets of the matching ways are ordered Ways with offset smaller than the fetch IP offset are discarded The first branch that is predicted taken is chosen as the predicted branch Jump into the fetch line Jump out of the line jmp Predict not taken taken

The P6 BTB 2-level, local histories, per-set counters
4-way set associative: 512 entries in 128 sets IP Tag Hist 1001 Pred= msb of counter 9 15 Way 0 Target Way 2 Way 3 4 32 counters 128 sets P T V 2 1 LRR Per-Set Branch Type 00- cond 01- ret 10- call 11- uncond Return Stack Buffer Way 1 Prediction bit ofst Up to 4 branches can have a tag match

In-Order Front End: Decoder
16 Instruction bytes from IFU Determine where each IA instruction starts Instruction Length Decode Buffer Instructions IQ D0 4 uops 1 uop D1 D2 1 uop D3 1 uop Convert instructions Into uops If inst aligned with dec1/2/3 decodes into >1 uops, defer it to next cycle IDQ Buffers uops Smooth decoder’s variable throughput

Micro Operations (Uops)
Each CISC inst is broken into one or more RISC uops Simplicity: Each uop is (relatively) simple Canonical representation of src/dest (2 src, 1 dest) Increased ILP e.g., pop eax becomes esp1<-esp0+4, eax1<-[esp0] Simple instructions translate to a few uops Typical uop count (it is not necessarily cycle count!) Reg-Reg ALU/Mov inst: 1 uop Mem-Reg Mov (load) 1 uop Mem-Reg ALU (load + op) 2 uops Reg-Mem Mov (store) 2 uops (st addr, st data) Reg-Mem ALU (ld + op + st) 4 uops Complex instructions need ucode

Out-of-order Core External Reorder Buffer (ROB): Bus
MIS AGU MOB External Bus IEU MIU FEU BTB BIU IFU I D RAT R S L2 DCU ROB Reorder Buffer (ROB): Holds “not yet retired” instructions 40 entries, in-order Reservation Stations (RS): Holds “not yet executed” instructions 20 entries Execution Units IEU: Integer Execution Unit FEU: Floating-point Execution Unit Memory related units AGU: Address Generation Unit MIU: Memory Interface Unit DCU: Data Cache Unit MOB: Orders Memory operations L2: Level 2 cache

Alloc & Rat Perform register allocation and renaming for ≤4 uops/cyc
The Register Alias Table (RAT) Maps architectural registers into physical registers For each arch reg, holds the number of latest phy reg that updates it When a new uop that writes to a arch reg R is allocated Record phy reg allocated to the uop as the latest reg that updates R The Allocator (Alloc) Assigns each uop an entry number in the ROB / RS For each one of the sources (architectural registers) of the uop Lookup the RAT to find out the latest phy reg updating it Write it up in the RS entry Allocate Load & Store buffers in the MOB Arch reg #reg Location EAX RRF EBX 19 ROB ECX 23

Architectural dest. reg
Re-order Buffer (ROB) Hold 40 uops which are not yet committed At the same order as in the program Provide a large physical register space for register renaming One physical register per each ROB entry physical register number = entry number Each uop has only one destination Buffer the execution results until retirement Valid data is set after uop executed and result written to physical reg #entry Entry Valid Data Physical Reg Data Architectural dest. reg 1 12H EBX 33H ECX 2 xxx ESI 39 XXX

RRF – Real Register File
Holds the Architectural Register File Architectural Register are numbered: 0 – EAX, 1 – EBX, … The value of an architectural register is the value written to it by the last instruction committed which writes to this register RRF: #entry Arch Reg Data 0 (EAX) 9AH 1 (EBX) F34H

Uop flow through the ROB
Uops are entered in order Registers renamed by the entry number Once assigned: execution order unimportant After execution: Entries marked “executed” and wait for retirement Executed entry can be “retired” once all prior instruction have retired Commit architectural state only after speculation (branch, exception) has resolved Retirement Detect exceptions and mispredictions Initiate repair to get machine back on right track Update “real registers” with value of renamed registers Update memory Leave the ROB

Reservation station (RS)
Pool of all “not yet executed” uops Holds the uop attributes and the uop source data until it is dispatched When a uop is allocated in RS, operand values are updated If operand is from an architectural register, value is taken from the RRF If operand is from a phy reg, with data valid set, value taken from ROB If operand is from a phy reg, with data valid not set, wait for value The RS maintains operands status “ready/not-ready” Each cycle, executed uops make more operands “ready” The RS arbitrate the WB busses between the units The RS monitors the WB bus to capture data needed by awaiting uops Data can be bypassed directly from WB bus to execution unit Uops whose all operands are ready can be dispatched for execution Dispatcher chooses which of the ready uops to execute next Dispatches chosen uops to functional units

Register Renaming example
IDQ Add EAX, EBX, EAX RAT / Alloc  #reg EAX RRF EBX 19 ROB ECX 23 #reg EAX 37 ROB EBX 19 ECX 23 Add ROB37, ROB19, RRF0 ROB  RS # Data Valid Data DST 19 V 12H EBX 23 33H ECX 37 I x xxx XXX 38 # Data Valid Data DST 19 V 12H EBX 23 33H ECX 37 I xxx EAX 38 x XXX v src1 src2 Pdst add 1 97H 12H 37 RRF: EAX 97H

Register Renaming example (2)
IDQ sub EAX, ECX, EAX RAT / Alloc  #reg EAX 37 ROB EBX 19 ECX 23 #reg EAX 38 ROB EBX 19 ECX 23 sub ROB38, ROB23, ROB37 ROB  RS # Data Valid Data DST 19 V 12H EBX 23 33H ECX 37 I x xxx XXX 38 # Data Valid Data DST 19 V 12H EBX 23 33H ECX 37 I xxx EAX 38 v src1 src2 Pdst add 1 97H 12H 37 sub rob37 33H 38 RRF: EAX 97H

Out-of-order Core: Execution Units
1st bypass in MIU internal 0-dealy bypass within each EU 2nd bypass in RS SHF FMU RS MIU FDIV IDIV FAU IEU Port 0 JEU IEU Port 1 Port 2 AGU Load Address Port 3,4 AGU Store Address SDB DCU

In-Order Retire ROB: External Bus Retires up to 3 uops per clock
MIS AGU MOB External Bus IEU MIU FEU BTB BIU IFU I D RAT R S L2 DCU ROB ROB: Retires up to 3 uops per clock Copies the values to the RRF Retirement is done In-order Performs exception checking

In-order Retirement The process of committing the results to the architectural state of the processor Retires up to 3 uops per clock Copies the values to the RRF Retirement is done In Order Performs exception checking An instruction is retired after the following checks Instruction has executed All previous instructions have retired Instruction isn’t mis-predicted no exceptions

Pipeline: Fetch Fetch 16B from I$
Predict/Fetch Decode IQ IDQ Alloc ROB RS Schedule EX Retire Fetch 16B from I$ Length-decode instructions within 16B Write instructions into IQ Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Decode Read 4 instructions from IQ
Predict/Fetch Decode Alloc Schedule EX Retire IQ IDQ RS ROB Read 4 instructions from IQ Translate instructions into uops Asymmetric decoders ( ) Write resulting uops into IDQ Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Allocate Allocate, port bind and rename 4 uops
Predict/Fetch Decode Alloc Schedule EX Retire IQ IDQ RS ROB Allocate, port bind and rename 4 uops Allocate ROB/RS entry per uop If source data is available from ROB or RRF, write data to RS Otherwise, mark data not ready in RS Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: EXE Ready/Schedule Write back results into RS/ROB
Predict/Fetch Decode IQ IDQ Alloc ROB RS Schedule EX Retire Ready/Schedule Check for data-ready uops if needed functional unit available Select and dispatch ≤6 ready uops/clock to EXE Reclaim RS entries Write back results into RS/ROB Write results into result buffer Snoop write-back ports for results that are sources to uops in RS Update data-ready status of these uops in the RS Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Retire Retire ≤4 oldest uops in ROB In case of exception
Predict/Fetch Decode Alloc Schedule EX Retire IQ IDQ RS ROB Retire ≤4 oldest uops in ROB Uop may retire if its ready bit is set it does not cause an exception all preceding candidates are eligible for retirement Commit results from result buffer to RRF Reclaim ROB entry In case of exception Nuke and restart Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Jump Misprediction – Flush at Retire
Flush the pipe when the mispredicted jump retires retirement is in-order all the instructions preceding the jump have already retired all the instructions remaining in the pipe follow the jump flush all the instructions in the pipe Disadvantage Misprediction known after the jump was executed Continue fetching instructions from wrong path until the branch retires

Jump Misprediction – Flush at Execute
When the JEU detects jump misprediction it Flush the in-order front-end Instructions already in the OOO part continue to execute Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed Start fetching and decoding from the “correct” path The “correct” path still be wrong A preceding uop that hasn’t executed may cause an exception A preceding jump executed OOO can also mispredict The “correct” instruction stream is stalled at the RAT The RAT was wrongly updated also by wrong path instruction When the mispredicted branch retires Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.) Only instruction following the jump are left – they must all be flushed Reset the RAT to point only to architectural registers Un-stalls the in-order machine RS gets uops from RAT and starts scheduling and dispatching them

Pipeline: Branch gets to EXE
Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Mispredicted Branch EXE
Flush Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Flush front-end and re-steer it to correct path RAT state already updated by wrong path Block further allocation Block younger branches from clearing Update BPU Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Mispredicted Branch Retires
Clear Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB When mispredicted branch retires Flush OOO Allow allocation of uops from correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Instant Reclamation Allow a faster recovery after jump misprediction
Allow execution/allocation of uops from correct path before mispredicted jump retires Every few cycles take a checkpoint of the RAT In case of misprediction Flush the frontend and re-steer it to the correct path Recover RAT to latest checkpoint taken prior to misprediction Recover RAT to exact state at misprediction Rename 4 uops/cycle from checkpoint and until branch Flush all uops younger than the branch in the OOO

Instant Reclamation Mispredicted Branch EXE
Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB JEClear raised on mispredicted macro-branches Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Instant Reclamation Mispredicted Branch EXE
BPU Update Clear Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB JEClear raised on mispredicted macro-branches Flush frontend and re-steer it to the correct path Flush all younger uops in OOO Update BPU Block further allocation Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Instant Reclamation: EXE
Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Restore RAT from latest check-point before branch Recover RAT to its states just after the branch Before any instruction on the wrong path Meanwhile front-end starts fetching and decoding instructions from the correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Instant Reclamation: EXE
Predict/Fetch Decode Alloc Schedule JEU Retire IQ IDQ RS ROB Once done restoring the RAT allow allocation of uops from correct path Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Large ROB and RS are Important
Large RS Increases the window in which looking for impendent instructions Exposes more parallelism potential Large ROB The ROB is a superset of the RS  ROB size ≥ RS size Allows for of covering long latency operations (cache miss, divide) Example Assume there is a Load that misses the L1 cache Data takes ~10 cycles to return  ~30 new instrs get into pipeline Instructions following the Load cannot commit  Pile up in the ROB Instructions independent of the load are executed, and leave the RS As long as the ROB is not full, we can keep executing instructions A 40 entry ROB can cover for an L1 cache miss Cannot cover for an L2 cache miss, which is hundreds of cycles

OOO Execution of Memory Operations

P6 Caches Blocking caches severely hurt OOO
A cache miss prevents from other cache requests (which could possibly be hits) to be served Hurts one of the main gains from OOO – hiding caches misses Both L1 and L2 cache in the P6 are non-blocking Initiate the actions necessary to return data to cache miss while they respond to subsequent cached data requests Support up to 4 outstanding misses Misses translate into outstanding requests on the P6 bus The bus can support up to 8 outstanding requests Squash subsequent requests for the same missed cache line Squashed requests not counted in number of outstanding requests Once the engine has executed beyond the 4 outstanding requests subsequent load requests are placed in the load buffer

Memory Operations Loads Stores Loads need to specify
memory address to be accessed width of data being retrieved the destination register Loads are encoded into a single uop Stores Stores need to provide a memory address, a data width, and the data to be written Stores require two uops: One to generate the address One to generate the data Uops are scheduled independently to maximize concurrency Can calculate the address before the data to be stored is known must re-combine in the store buffer for the store to complete

The Memory Execution Unit
As a black box, the MEU is just another execution unit Receives memory reads and writes from the RS Returns data back to the RS and to the ROB Unlike many execution units, the MEU has side effects May cause a bus request to external memory The RS dispatches memory uops When all the data used for address calculation is ready Both to the MOB and to the Address Generation Unit (AGU) are free The AGU computes the linear address: Segment-Base + Base-Address + (Scale*Index) + Displacement Sends linear address to MOB, where it is stored with the uop in the appropriate Load Buffer or Store Buffer entry

The Memory Execution Unit (cont.)
The RS operates based on register dependencies Cannot detect memory dependencies, even as simple as: i1: movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx i2: movl %eax, -4(%ebp) # eax ← MEM[ebp-4] The RS may dispatch these operations in any order MEU is in-charge of memory ordering If MEU blindly executes above operations, eax may get wrong value Could require memory operations dispatch non-speculatively x86 has small register set  uses memory often Preventing Stores from passing Stores/Loads: 3%~5% perf. loss P6 chooses not allow Stores to pass Stores/Loads Preventing Loads from passing Loads/Stores Big performance loss Need to allow loads to pass stores, and loads to pass loads MEU allows speculative execution as much as possible

OOO Execution of Memory Operations
Stores are not executed OOO Stores are never performed speculatively there is no transparent way to undo them Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when the store has both its address and its data, and there are no older stores awaiting dispatch Resolving memory dependencies Some memory dependencies can be resolved statically store r1,a load r2,b Problem: some cannot store r1,[r3];  can advance load before store  load must wait till r3 is known

Memory Order Buffer (MOB)
Every memory uop is allocated an entry in order De-allocated when retires Address & data (for stores), are updated when known Load is checked against all previous stores Waits if store to same address exist, but data not ready If store data exists, just use it: Store → Load forwarding Waits till all previous store addresses are resolved In case of no address collision – go to memory Memory Disambiguation MOB predicts if a load can proceed despite unknown STAs Predict colliding  block Load if there is unknown STA (as usual) Predict non colliding  execute even if there are unknown STAs In case of wrong prediction The entire pipeline is flushed when the load retires

Pipeline: Load: Allocate
Schedule AGU LB Write DTLB DCU WB MOB Retire IDQ RS ROB LB Allocate ROB/RS, MOB entries Assign Store ID to enable ordering Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Bypassed Load: EXE
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB RS checks when data used for address calculation is ready AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. Write load into Load Buffer DTLB Virtual → Physical + DCU set access MOB checks blocking and forwarding DCU read / Store Data Buffer read (Store → Load forwarding) Write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Blocked Load Re-dispatch
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB MOB determines which loads are ready, and schedules one Load arbitrates for MEU DTLB Virtual → Physical + DCU set access MOB checks blocking/forwarding DCU way select / Store Data Buffer read write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Load: Retire
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB Reclaim ROB, LB entries Commit results to RRF Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

DCU Miss If load misses in the DCU, the DCU
Marks the write-back data as invalid Assigns a fill buffer to the load Starts MLC access When critical chunk is returned DCU writes it back on WB bus

Pipeline: Store: Allocate
Schedule AGU SB Retire IDQ RS ROB DTLB SB Allocate ROB/RS Allocate Store Buffer entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Store: STA EXE
Alloc Schedule AGU SB V.A. Retire IDQ RS ROB DTLB SB P.A. SB RS checks when data used for address calculation is ready dispatches STA to AGU AGU calculates linear address Write linear address to Store Buffer DTLB Virtual → Physical Load Buffer Memory Disambiguation verification Write physical address to Store Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Store: STD EXE
Alloc Schedule SB data Retire IDQ RS ROB SB RS checks when data for STD is ready dispatches STD Write data to Store Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Senior Store Retirement
Alloc Schedule Retire IDQ RS ROB SB DCU MOB SB When STA (and thus STD) retires Store Buffer entry marked as senior When DCU idle  MOB dispatches senior store Read senior entry Store Buffer sends data and physical address DCU writes data Reclaim SB entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Lihu Rappoport and Adi Yoaz

Similar presentations

Presentation on theme: "Lihu Rappoport and Adi Yoaz"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lihu Rappoport and Adi Yoaz

Similar presentations

Presentation on theme: "Lihu Rappoport and Adi Yoaz"— Presentation transcript:

Similar presentations

About project

Feedback