Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Structure 2014 – P6 uArch 1 Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi.

Similar presentations


Presentation on theme: "Computer Structure 2014 – P6 uArch 1 Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi."— Presentation transcript:

1 Computer Structure 2014 – P6 uArch 1 Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi Yoaz

2 Computer Structure 2014 – P6 uArch 2 The P6 Family  Features –Out Of Order execution –Register renaming –Speculative Execution –Multiple Branch prediction –Super pipeline: 12 pipe stages ProcessorYearFreq (MHz) Bus (MHz) L2 cacheProcess Pentium® Pro1995150~20060/66256/512K*0.5μ, 0.35μ Pentium® II1997233~45066/100512K*0.35μ, 0.25μ Pentium® III1999450~1400100/133256/512K0.25μ, 0.18μ, 0.13μ Pentium® M2003900~2260400/5331M / 2M0.13μ, 90nm Core TM 20051660~23306672M65nm Core TM 220061800~2930800/10662/4/8M65nm, 45nm *off die

3 Computer Structure 2014 – P6 uArch 3  In-Order Front End –BIU: Bus Interface Unit –IFU: Instruction Fetch Unit (includes IC) –BPU: Branch Prediction Unit –ID: Instruction Decoder –MS: Micro-Instruction Sequencer –RAT: Register Alias Table  Out-of-order Core –ROB: Reorder Buffer –RRF: Real Register File –RS: Reservation Stations –IEU: Integer Execution Unit –FEU: Floating-point Execution Unit –AGU: Address Generation Unit –MIU: Memory Interface Unit –DCU: Data Cache Unit –MOB: Memory Order Buffer –L2: Level 2 cache  In-Order Retire P6  Arch

4 Computer Structure 2014 – P6 uArch 4 I1: Next IP I2: ICache lookup I3: ILD (instruction length decode) I4: Steer the instruction bytes to the decoders I5: ID1 – decode the instructions I6: ID2 – decode the instructions I7: RAT – rename sources, I8: ALLOC-assign destinations ROB-read sources O1: RS-schedule data-ready uops for dispatch RS-dispatch uops O2: EX R1: Retirement R2: Retirement P6 Pipeline Ex Next IP Reg Ren RS Wr IcacheDecode RS disp Retirement In-Order Front End + rename/alloc Out-of-order Core In-order Retirement I1I2I3I4I5I6I7I8 O1 O3 R1 R2

5 Computer Structure 2014 – P6 uArch 5 In-Order Front End  BPU – Branch Prediction Unit – predict next fetch address  IFU – Instruction Fetch Unit –iTLB translates virtual to physical address (access PMH on miss) –IC supplies 16byte/cyc (access L2 cache on miss)  ILD – Induction Length Decode – split bytes to instructions  IQ – Instruction Queue – buffer the instructions  ID – Instruction Decode – decode instructions into uops  MS – Micro-Sequencer – provides uops for complex instructions Next IP Mux BPU ID MS ILD IQIDQIFU BytesInstructionsuops

6 Computer Structure 2014 – P6 uArch 6 Branch Prediction  Need to provide predictions for the entire fetch line each cycle –Predict the first taken branch in the line, following the fetch IP  Implemented by –Splitting IP into offset within line, set, and tag –If the tag of more than one way matches the fetch IP  The offsets of the matching ways are ordered  Ways with offset smaller than the fetch IP offset are discarded  The first branch that is predicted taken is chosen as the predicted branch Jump into the fetch line Jump out of the line jmp Predict not taken Predict taken Predict taken Predict taken

7 Computer Structure 2014 – P6 uArch 7 The P6 BTB  512 entries in 128 sets × 4 ways –Up to 4 branches can have a tag match  Each entry holds a branch target and a 4-bit local branch history –The 4 histories in each set all point to a shared 16 entry 2-bit counter array IP Tag Hist 1001 Pred= msb of counter 9 0 15 Way 0 Target Way 2 Way 3 9 4 32 counters 128 sets PTV 211 32 LRR 2 Per-Set Branch Type 00- cond 01- ret 10- call 11- uncond Return Stack Buffer Way 1 Prediction bit 4 ofst

8 Computer Structure 2014 – P6 uArch 8 Determine where each IA instruction starts In-Order Front End: Decoder Buffer Instructions Convert instructions Into uops If inst aligned with dec1/2/3 decodes into >1 uops, defer it to next cycle Buffers uops Smooth decoder’s variable throughput IQ Instruction Length Decode 16 Instruction bytes from IFU 1 uop D1D1 IDQ D2D2 1 uop D0D0  4 uops D3D3 1 uop

9 Computer Structure 2014 – P6 uArch 9 Micro Operations (Uops)  Each CISC inst is broken into one or more RISC uops –Each uop is (relatively) simple –Canonical representation of src/dest (2 src, 1 dest) –Increased ILP  e.g., pop eax becomes esp1<-esp0+4, eax1<-[esp0]  Simple instructions translate to a few uops –Typical uop count (it is not necessarily cycle count!) Reg-Reg ALU/Mov inst:1 uop Mem-Reg Mov (load)1 uop Mem-Reg ALU(load + op)2 uops Reg-Mem Mov (store)2 uops (st addr, st data) Reg-Mem ALU(ld + op + st)4 uops  Complex instructions need ucode

10 Computer Structure 2014 – P6 uArch 10 Out-of-order Core  Reorder Buffer (ROB): –Holds “not yet retired” instructions –40 entries, in-order  Reservation Stations (RS): –Holds “not yet executed” instructions –20 entries  Execution Units –IEU: Integer Execution Unit –FEU: Floating-point Execution Unit  Memory related units –AGU: Address Generation Unit MIU: Memory Interface Unit –DCU: Data Cache Unit –MOB: Orders Memory operations –L2: Level 2 cache

11 Computer Structure 2014 – P6 uArch 11 Alloc & Rat  Perform register allocation and renaming for ≤4 uops/cyc  The Register Alias Table (RAT) –Maps architectural registers into physical registers  For each arch reg, holds the number of latest phy reg that updates it –When a new uop that writes to a arch reg R is allocated  Record phy reg allocated to the uop as the latest reg that updates R  The Allocator (Alloc) –Assigns each uop an entry number in the ROB / RS –For each one of the sources (architectural registers) of the uop  Lookup the RAT to find out the latest phy reg updating it  Write it up in the RS entry –Allocate Load & Store buffers in the MOB Arch reg#regLocation EAX0RRF EBX19ROB ECX23ROB

12 Computer Structure 2014 – P6 uArch 12 Re-order Buffer (ROB)  Hold 40 uops which are not yet committed –At the same order as in the program  Provide a large physical register space for register renaming –One physical register per each ROB entry  physical register number = entry number  Each uop has only one destination  Buffer the execution results until retirement –Valid data is set after uop executed and result written to physical reg #entryEntry Valid Data Valid Physical Reg Data Architectural dest. reg 01112HEBX 11133HECX 210xxxESI 3900xxxXXX

13 Computer Structure 2014 – P6 uArch 13 RRF – Real Register File  Holds the Architectural Register File –Architectural Register are numbered: 0 – EAX, 1 – EBX, …  The value of an architectural register –is the value written to it by the last instruction committed which writes to this register RRF: #entryArch Reg Data 0 (EAX)9AH 1 (EBX)F34H

14 Computer Structure 2014 – P6 uArch 14 Uop flow through the ROB  Uops are entered in order –Registers renamed by the entry number  Once assigned: execution order unimportant  After execution –Entries marked “executed” and wait for retirement –Executed entry can be “retired” once all prior instruction have retired –Commit architectural state only after speculation (branch, exception) has resolved  Retirement –Detect exceptions and mispredictions  Initiate repair to get machine back on right track –Update “real registers” with value of renamed registers –Update memory –Leave the ROB

15 Computer Structure 2014 – P6 uArch 15 Reservation station (RS)  Pool of all “not yet executed” uops –Holds the uop attributes and the uop source data until it is dispatched  When a uop is allocated in RS, operand values are updated –If operand is from an architectural register, value is taken from the RRF –If operand is from a phy reg, with data valid set, value taken from ROB –If operand is from a phy reg, with data valid not set, wait for value  The RS maintains operands status “ready/not-ready” –Each cycle, executed uops make more operands “ready”  The RS arbitrate the WB busses between the units  The RS monitors the WB bus to capture data needed by awaiting uops  Data can be bypassed directly from WB bus to execution unit –Uops whose all operands are ready can be dispatched for execution  Dispatcher chooses which of the ready uops to execute next  Dispatches chosen uops to functional units

16 Computer Structure 2014 – P6 uArch 16 Register Renaming example RS RAT / Alloc  IDQ Add EAX, EBX, EAX #reg EAX0RRF EBX19ROB ECX23ROB Add ROB37, ROB19, RRF0 ROB  # Data Valid DataDST 19VV12HEBX 23VV33HECX 37IxxxxXXX 38IxxxxXXX vsrc1vsrc2Pdst add197H112H37 RRF: 0EAX97H # Data Valid DataDST 19VV12HEBX 23VV33HECX 37VIxxxEAX 38IxxxxXXX #reg EAX37ROB EBX19ROB ECX23ROB

17 Computer Structure 2014 – P6 uArch 17 Register Renaming example (2) RS RAT / Alloc  IDQ sub EAX, ECX, EAX #reg EAX37ROB EBX19ROB ECX23ROB sub ROB38, ROB23, ROB37 ROB  # Data Valid DataDST 19VV12HEBX 23VV33HECX 37IxxxxXXX 38IxxxxXXX vsrc1vsrc2Pdst add197H112H37 sub0rob37133H38 RRF: 0EAX97H # Data Valid DataDST 19VV12HEBX 23VV33HECX 37VIxxxEAX 38VIxxxEAX #reg EAX38ROB EBX19ROB ECX23ROB

18 Computer Structure 2014 – P6 uArch 18 Out-of-order Core: Execution Units MIU Port 0 Port 1 Port 2 Port 3,4 SHF FMU FDIV IDIV FAU IEU JEU IEU AGU Load Address Store Address internal 0-dealy bypass within each EU 2 nd bypass in RS RS 1 st bypass in MIU DCU SDB

19 Computer Structure 2014 – P6 uArch 19 In-Order Retire  ROB: –Retires up to 4 uops per clock –Copies the values to the RRF –Retirement is done In-order –Performs exception checking

20 Computer Structure 2014 – P6 uArch 20 In-order Retirement  The process of committing the results to the architectural state of the processor  Retire up to 4 uops per clock  Copy the values to the RRF  Retirement is done In Order  Perform exception checking  An instruction is retired after the following checks –Instruction has executed –All previous instructions have retired –Instruction isn’t mis-predicted –no exceptions

21 Computer Structure 2014 – P6 uArch 21 Pipeline: Fetch  Fetch 16B from I$  Length-decode instructions within 16B  Write instructions into IQ Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

22 Computer Structure 2014 – P6 uArch 22 Pipeline: Decode  Read 4 instructions from IQ  Translate instructions into uops –Asymmetric decoders (4-1-1-1)  Write resulting uops into IDQ Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

23 Computer Structure 2014 – P6 uArch 23 Pipeline: Allocate  Allocate, port bind and rename 4 uops  Allocate ROB/RS entry per uop –If source data is available from ROB or RRF, write data to RS –Otherwise, mark data not ready in RS Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

24 Computer Structure 2014 – P6 uArch 24 Pipeline: EXE  Ready/Schedule –Check for data-ready uops if needed functional unit available –Select and dispatch ≤6 ready uops/clock to EXE –Reclaim RS entries  Write back results into RS/ROB –Write results into result buffer –Snoop write-back ports for results that are sources to uops in RS –Update data-ready status of these uops in the RS Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

25 Computer Structure 2014 – P6 uArch 25 Pipeline: Retire  Retire ≤4 oldest uops in ROB –Uop may retire if  its ready bit is set  it does not cause an exception  all preceding candidates are eligible for retirement –Commit results from result buffer to RRF –Reclaim ROB entry  In case of exception –Nuke and restart Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleEXRetire

26 Computer Structure 2014 – P6 uArch 26 Jump Misprediction – Flush at Execute  When the JEU detects jump misprediction it –Flush the in-order front-end –Instructions already in the OOO part continue to execute  Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed –Start fetching and decoding from the “correct” path  The “correct” path still be wrong A preceding uop that hasn’t executed may cause an exception A preceding jump executed OOO can also mispredict –The “correct” instruction stream is stalled at the RAT  The RAT was wrongly updated also by wrong path instruction  When the mispredicted branch retires –Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.)  Only instruction following the jump are left – they must all be flushed  Reset the RAT to point only to architectural registers –Un-stalls the in-order machine –RS gets uops from RAT and starts scheduling and dispatching them

27 Computer Structure 2014 – P6 uArch 27 FetchDecode IQIDQ Alloc ROBRS Retire Schedule JEU Pipeline: Branch gets to EXE

28 Computer Structure 2014 – P6 uArch 28  Flush front-end and re-steer it to correct path  RAT state already updated by wrong path –Block further allocation  Update BPU  OOO not flushed: Instructions already in the OOO continue to execute –Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed  Block younger branches from clearing FetchDecode IQIDQ Alloc ROBRS Retire Schedule JEU Flush Pipeline: Mispredicted Branch EXE

29 Computer Structure 2014 – P6 uArch 29  When mispredicted branch retires –Flush OOO  Only instruction following the jump are left – they must all be flushed  Resets all state in the OOO (RAT, RS, RB, MOB, etc.)  Reset the RAT to point only to architectural registers –Allow allocation of uops from correct path FetchDecode IQIDQ Alloc ROBRS Retire Schedule JEU Pipeline: Mispredicted Branch Retires Clear

30 Computer Structure 2014 – P6 uArch 30 Instant Reclamation  Allow a faster recovery after jump misprediction –Allow execution/allocation of uops from correct path before mispredicted jump retires  Every few cycles take a checkpoint of the RAT  In case of misprediction –Flush the frontend and re-steer it to the correct path –Recover RAT to latest checkpoint taken prior to misprediction –Recover RAT to exact state at misprediction  Rename 4 uops/cycle from checkpoint and until branch –Flush all uops younger than the branch in the OOO

31 Computer Structure 2014 – P6 uArch 31 Instant Reclamation Mispredicted Branch EXE  JEClear raised on mispredicted macro-branches Clear Alloc ROBRS Schedule JEU RetirePredict/FetchDecode IQIDQ

32 Computer Structure 2014 – P6 uArch 32 Instant Reclamation Mispredicted Branch EXE  JEClear raised on mispredicted macro-branches –Flush frontend and re-steer it to the correct path –Flush all younger uops in OOO –Update BPU –Block further allocation Clear Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleRetire BPU Update JEU

33 Computer Structure 2014 – P6 uArch 33 Pipeline: Instant Reclamation: EXE  Restore RAT from latest check-point before branch  Recover RAT to its states just after the branch –Before any instruction on the wrong path  Meanwhile front-end starts fetching and decoding instructions from the correct path Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleRetire JEU

34 Computer Structure 2014 – P6 uArch 34 Pipeline: Instant Reclamation: EXE  Once done restoring the RAT –allow allocation of uops from correct path Predict/FetchDecode IQIDQ Alloc ROBRS ScheduleRetire JEU

35 Computer Structure 2014 – P6 uArch 35 Large ROB and RS are Important  Large RS –Increases the window in which looking for impendent instructions  Exposes more parallelism potential  Large ROB –The ROB is a superset of the RS  ROB size ≥ RS size –Allows for of covering long latency operations (cache miss, divide)  Example –Assume there is a Load that misses the L1 cache  Data takes ~10 cycles to return  ~30 new instrs get into pipeline –Instructions following the Load cannot commit  Pile up in the ROB –Instructions independent of the load are executed, and leave the RS  As long as the ROB is not full, we can keep executing instructions –A 40 entry ROB can cover for an L1 cache miss  Cannot cover for an LLC cache miss, which is hundreds of cycles

36 Computer Structure 2014 – P6 uArch 36 OOO Execution of Memory Operations

37 Computer Structure 2014 – P6 uArch 37 P6 Caches  Blocking caches severely hurt OOO –A cache miss prevents from other cache requests (which could possibly be hits) to be served –Hurts one of the main gains from OOO – hiding caches misses  Both L1 and L2 cache in the P6 are non-blocking –Initiate the actions necessary to return data to cache miss while they respond to subsequent cached data requests –Support up to 4 outstanding misses  Misses translate into outstanding requests on the P6 bus  The bus can support up to 8 outstanding requests  Squash subsequent requests for the same missed cache line –Squashed requests not counted in number of outstanding requests –Once the engine has executed beyond the 4 outstanding requests  subsequent load requests are placed in the load buffer

38 Computer Structure 2014 – P6 uArch 38 OOO Execution of Memory Operations  The RS operates based on register dependencies –RS cannot detect memory dependencies movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx movl %eax, -4(%ebp) # eax ← MEM[ebp-4] –RS dispatches memory uops when data for address calculation is ready, and the MOB and Address Generation Unit (AGU) are free –AGU computes the linear address Segment-Base + Base-Address + (Scale*Index) + Displacement  Sends linear address to MOB, to be stored in Load Buffer or Store Buffer  MOB resolves memory dependencies and enforces memory ordering –Some memory dependencies can be resolved statically store r1,a load r2,b –Problem: some cannot store r1,[r3]; load r2,b  can advance load before store  load must wait till r3 is known

39 Computer Structure 2014 – P6 uArch 39 Load and Store Ordering  x86 has small register set  uses memory often –Preventing Stores from passing Stores/Loads: 3%~5% perf. loss  P6 chooses not allow Stores to pass Stores/Loads –Preventing Loads from passing Loads/Stores: big perf. loss  P6 allows Loads to pass Stores, and Loads to pass Loads  Stores are not executed OOO –Stores are never performed speculatively  there is no transparent way to undo them –Stores are also never re-ordered among themselves  The Store Buffer dispatches a store only when the store has both its address and its data, and there are no older stores awaiting dispatch –Store commits its write to memory (DCU) at retirement

40 Computer Structure 2014 – P6 uArch 40 Store Implemented as 2 Uops  Store decoded as two independent uops –STA (store-address): calculates the address of the store –STD (store-data): stores the data into the Store Data buffer  The actual write to memory is done when the store retires  Separating STA & STD is important for memory OOO –Allows STA to dispatch earlier, even before the data is known –Address conflicts resolved earlier  opens memory pipeline for other loads  STA and STD can be issued to execution units in parallel –STA dispatched to AGU when its sources (base+index) are ready –STD dispatched to SDB when its source operand is available

41 Computer Structure 2014 – P6 uArch 41 Memory Order Buffer (MOB)  Store Coloring –Each Store allocated in-order in Store Buffer, and gets a SBID –Each load allocated in-order in Load Buffer, and gets LBID + current SBID  Load is checked against all previous stores –Stores with SBID ≤ load’s SBID  Load blocked if –Unresolved address of a relevant STAs –STA to same address, but data not ready –Missing resources (DTLB miss, DCU miss)  MOB writes blocking info into load buffer –Re-dispatches load when wake-up signal received  If Load is not blocked  executed (bypassed) LBIDSBID Store-0 -1 Load01 Store-2 Load12 22 32 Store-3 Load43

42 Computer Structure 2014 – P6 uArch 42 MOB (Cont.)  If a Load misses in the DCU –The DCU marks the write-back data as invalid –Assigns a fill buffer to the load, and issues an L2 request –When critical chunk is returned, wakeup and re-dispatch the load  Store → Load Forwarding –Older STA with same address as load and data ready  Load gets its data directly from the SB (no DCU access)  Memory Disambiguation –MOB predicts if a load can proceed despite unknown STAs  Predict colliding  block Load if there is unknown STA (as usual)  Predict non colliding  execute even if there are unknown STAs –In case of wrong prediction  The entire pipeline is flushed when the load retires

43 Computer Structure 2014 – P6 uArch 43 Pipeline: Load: Allocate  Allocate ROB/RS, MOB entries  Assign Store Buffer ID (SBID) to enable ordering IDQ Alloc ROBRS Retire Schedule LB AGU LB Write DTLBDCUWBMOB

44 Computer Structure 2014 – P6 uArch 44 Pipeline: Bypassed Load: EXE  RS checks when data used for address calculation is ready  AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp.  Write load into Load Buffer  DTLB Virtual → Physical + DCU set access  MOB checks blocking and forwarding  DCU read / Store Data Buffer read (Store → Load forwarding)  Write back data / write block code IDQ Alloc ROBRS RetireSchedule LB AGU LB Write DTLBDCUWBMOB

45 Computer Structure 2014 – P6 uArch 45 Pipeline: Blocked Load Re-dispatch  MOB determines which loads are ready, and schedules one  Load arbitrates for MEU  DTLB Virtual → Physical + DCU set access  MOB checks blocking/forwarding  DCU way select / Store Data Buffer read  write back data / write block code IDQ Alloc ROBRS RetireSchedule LB AGU LB Write DTLBDCUWBMOB

46 Computer Structure 2014 – P6 uArch 46 Pipeline: Load: Retire  Reclaim ROB, LB entries  Commit results to RRF IDQ Alloc ROBRS RetireSchedule LB AGU LB Write DTLBDCUWBMOB

47 Computer Structure 2014 – P6 uArch 47 Pipeline: Store: Allocate  Allocate ROB/RS  Allocate Store Buffer entry IDQRS AllocScheduleAGUSB DTLB ROB SB Retire

48 Computer Structure 2014 – P6 uArch 48 Pipeline: Store: STA EXE  RS checks when data used for address calculation is ready –dispatches STA to AGU  AGU calculates linear address  Write linear address to Store Buffer  DTLB Virtual → Physical  Load Buffer Memory Disambiguation verification  Write physical address to Store Buffer IDQRS ScheduleAllocAGU SB V.A. ROB DTLB SB P.A. SB Retire

49 Computer Structure 2014 – P6 uArch 49 Pipeline: Store: STD EXE  RS checks when data for STD is ready –dispatches STD  Write data to Store Buffer IDQRS ScheduleAlloc SB data ROB SB Retire

50 Computer Structure 2014 – P6 uArch 50 Pipeline: Senior Store Retirement  When STA (and thus STD) retires –Store Buffer entry marked as senior  When DCU idle  MOB dispatches senior store  Read senior entry –Store Buffer sends data and physical address  DCU writes data  Reclaim SB entry SB IDQRS ScheduleAlloc ROB Retire SBDCUMOB


Download ppt "Computer Structure 2014 – P6 uArch 1 Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi."

Similar presentations


Ads by Google