SUPERSCALAR DESIGN PRIME Zhao Zhang CprE 381, Computer Organization and Assembly- Level Programming, Fall 2012 Original slides from CprE 581, Advanced.

SUPERSCALAR DESIGN PRIME Zhao Zhang CprE 381, Computer Organization and Assembly- Level Programming, Fall 2012 Original slides from CprE 581, Advanced Computer Architecture

History Superscalar Design First appearance in 1960s Scoreboarding Tomasulo Algorithm Popular use since 1990s SGI MIPS processors Sun UltraSPARC Dec Alpha 21x64 series Intel/AMD processors Now appearing in embedded processors Cortex-A9: Two-way, limited out-of-order Certex-A15: Three-way, close to Intel/AMD design

Why Superscalar Get more performance than scalar pipeline Superscalar Techniques: Deep pipeline Multi-issue Branch prediction Register renaming Out-of-order Execution Speculative Execution Memory disambiguation

Code Example for (i = 0; i < 1000; i++) X[i] = X[i] + b; ; loop body, initialization not shown ; R4: &X[i], R5: (X+1000)*4, R6: b Loop:LWR8, R4($0); load X[i], R4 stores X ADDR8, R8, R6 ; X[i] = X[i] + b SWR8, R4($0); store X[i] ADDIR4, R4, 4; next element SLTR9, R4, R5 ; R9 = (R4 < R5) BNER9, R0, loop ; end of loop?

Frontend and Backend Frontend: In-order fetch, decode, and rename Backend: Out-of-order issue, execute/writeback, in-order commit Frontend may send “junk” instructions to the backend Junk instructions occur with branch mis-prediction or exceptions Design goal: Minimize the percentage of “junk” instructions Backend must be able to detect and handle “junk” instructions Flush junk instructions upon detetion In-order commit (retire) so that junk instructions won’t affect the “architectural state” Dozens of cycles likely for handling a branch mis-prediction

Frontend and Backend Frontend Backend “Cortex-A9 Processor Microarchitecture”, slide 6

The Multi-Issue Factor Multi-issue affects all pipeline stages: In the same cycle, N inst. are fetched: Usually from one I-cache block N inst. are decoded: Multiple decoders N inst. are renamed: Multi-ported renaming table, detecting intra- group dependence In the backend Up to N inst. are scheduled: Multi-ported queue with broadcast N inst. read register file: Multi-ported register file M inst. are executed at functional units: Multiple functional units N inst. writes back register values: multi-ported register file N inst. are committed: Multi-banked reorder buffer, also involves rename table Note: “N” is not necessary the same value across pipeline stages

Frontend: Branch Prediction Branch prediction is critical to reducing “junk” instructions With “disaster” branch prediction performance: SPECint programs have on average ~15% branches Every 100 instructions contain 15 branches Assume 10% mis-prediction => 1.5 branch mis-predictions Assume 20-cycle mis-prediction penalty => 30 lost cycles Assume IPC=3.0 => 33.3 cycles for execution 100 inst 90% loss for the 10% mis-prediction Mis-prediction penalty is workload-dependent, and can be significantly longer than 20 cycles good inst

Frontend: Branch Prediction Branch prediction is made every cycle Otherwise, instruction flow stops It’s done in parallel with instruction fetch The backend sends back feedback about past predictions Inst. Cache Pred-PC INST Target, branch, and return addr. predictors Single cycle loop Feedback from the backend

Frontend: Branch Prediction Three components in simple design Branch Target Buffer (BTB): What’s the branch target? Branch History Table (BHT): Is the branch taken or not? Return Address Stack (RAS) Function return is a special type of branch instruction There are multiple valid branch targets for the return How BTB and BHT works in general Bet the same patterns will repeat Use only PC and past branch outcome history in the prediction

Frontend: Branch Prediction Branch Target Buffer with combined Branch History Table Branch PCPredicted PC =? PC of instruction FETCH Extra prediction state Bits (see later) Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) From slides of CprE 581 Computer Systems Architecture

Frontend: Branch Prediction First time fetching at BNE: Predicted as Not Taken Loop:LWR8, R4($0); load X[i], R4 stores X ADDR8, R8, R6 ; X[i] = X[i] + b SWR8, R4($0); store X[i] ADDIR4, R4, 4; next element SLTR9, R4, R5 ; end of array? BNER9, R0, loop => mis-prediction on 1 st fetch -- 0 0 0 0 0 0 Branch PCPredicted PC LW ADD SW ADDI STL BNE => NT, right => NT, WRONG

Frontend: Branch Prediction What happen after the mis-prediction 1. The frontend starts fetch junk instructions, probably in dozens 2. The backend detects the mis-prediction, flush backend pipeline, notifies the frontend about the mis-predicted branch 3. The frontend updates the BTB/BHT, filling in BNE-PC and LW-PC, change prediction state bit 4. The frontend restarts fetching from LW-PC -- 0 0 0 0 0 BNE-PCLW-PC1 Branch PCPredicted PC LW ADD SW ADDI STL BNE

Frontend: Branch Prediction 2 nd time fetching at BNE: Predicted as Taken, jump to LW-PC Loop: LWR8, R4($0); load X[i], R4 stores X ADDR8, R8, R6 ; X[i] = X[i] + b SWR8, R4($0); store X[i] ADDIR4, R4, 4; next element SLTR9, R4, R5 ; end of array? =>BNER9, R0, loop ; -- 0 0 0 0 0 BNE-PCLW-PC1 Branch PCPredicted PC LW ADD SW ADDI STL BNE => NT, right => Taken, RIGHT

Frontend: Branch Prediction Last time fetching at BNE-PC, predicted as Taken It’s wrong because the loop will exit This time, the prediction state bit is changed to 0 Next time the prediction outcome on BNE-PC is Not Taken -- 0 0 0 0 0 BNE-PCLW-PC0 Branch PCPredicted PC LW ADD SW ADDI STL BNE

16 Branch Prediction State Bit state 2. Predict Output T/NT 1. Access 3. Feedback T/NT T Predict Taken Predict Not Taken 1 0 T NT General Form 1-bit prediction NT PC Feedback From CprE 581, Computer Systems Architecture

Branch History Table Branch direction prediction is usually more challenging BHT can be separated from BTB (often the case) 2-bit or 3-bit state are usually used BHT can be organized in two levels to predict on correlation between branches BHT can have sophisticated organizations to further improve accuracy Return Address Stack: Work on return instructions, simple and effective (not to be discussed more)

Frontend: Register Renaming Consider two loop iterations: Conflict on register usage, cannot be executed in parallel, but they are mostly parallel LWR8, R4($0); load X[i], R4 stores X ADDR8, R8, R6 ; X[i] = X[i] + b SWR8, R4($0); store X[i] ADDIR4, R4, 4; next element SLTR9, R4, R5 ; end of array? BNER9, R0, loop ; LWR8, R4($0); load X[i], R4 stores X ADDR8, R8, R6 ; X[i] = X[i] + b SWR8, R4($0); store X[i] ADDIR4, R4, 4; next element SLTR9, R4, R5 ; end of array? BNER9, R0, loop ;

Frontend: Register Renaming Rename architectural registers to physical registers, remove false dependence and keep true dep. LWP32, P4($0); load X[i], R4 stores X ADDP33, P32, P6; X[i] = X[i] + b SWP33, P4($0); store X[i] ADDIP34, P4, 4; next element SLTP35, P34, P5; end of array? BNEP35, P0, loop ; LWP36, P34($0); load X[i], R4 stores X ADDP37, P36, P6; X[i] = X[i] + b SWP37, P34($0); store X[i] ADDIP38, P34, 4; next element SLTP38, P38, P5 ; end of array? BNER38, p0, loop ;

Frontend: Register Renaming How the design works: There is a register mapping table that maps architecture register to physical register There is a queue of free physical register Every instruction with output register is assigned with an unused, free physical register Another mapping table is used to recover from mis-predicted path There are a number of design variants in real processors

Frontend: Register Renaming The roles of register renaming: Remove register name dependence, keep true data dependence, so that more instructions can be safely reordered Help backend implement speculative execution, as no junk instructions cannot affect the input of good instructions A younger instruction writes to newly assigned physical register, so it cannot affect the input of old instructions A good instruction is always older than any junk instruction

Backend: Out-Of-Order Scheduling Common Design: Issue Queue LWP32P4yes 0x0yes Op busy? dst src1 ready? src2 ready? ROB LSQ ADDP33P32yes noP6yes SW--P33yes noP4yes ADDIP34P4yes 0x4yes SLTP35P34yes noP5yes BNE--P35yes noP0yes 1 2 3 4 5 6 1 - 2 - - -

Backend: Out-Of-Order Scheduling Schedule: Select ready instructions, broadcast their tag (dst) to all other instructions for matching LWP32P4yes 0x0yes Op busy? dst src1 ready? src2 ready? ROB LSQ ADDP33P32yes noP6yes SW--P33yes noP4yes ADDIP34P4yes 0x4yes SLTP35P34yes noP5yes BNE--P35yes noP0yes 1 3 2 4 5 6 1 - 2 - - -

Backend: Out-Of-Order Scheduling After LW and ADDI are issued, assume no new instructions -- no -- Op busy? dst src1 ready? src2 ready? ROB LSQ ADDP33P32yes P6yes SW--P33yes noP4yes -- SLTP35P34yes P5yes BNE--P35yes noP0yes -- 2 3 5 6 - 2 - - -

Backend: Out-Of-Order Scheduling After ADD and SLT are issued, assume no new instructions -- no -- Op busy? dst src1 ready? src2 ready? ROB LSQ -- no -- SW--P33yes P4yes -- BNE--P35yes P0yes -- 2 6 - 2 - - -

Backend: Out-Of-Order Scheduling How the design works Instructions are sent to the issue queue after renaming A select logic chooses up to N instructions, all dependence free, to be executed The tag of the selected instructions are broadcast to all other queue entries A wakeup logic clears the dependence of other instructions on the selected instructions Two major design variants: Issue Queue vs. Reservation Station

Backend: Register Read, Data Forwarding and Writeback Note: In reservation-station design, register-read happens before instruction scheduling Issue Queue Register File Forwarding Network Load Store Load Store Int Mult Div Mult Div Other Issue (scheduling) Reg-Read Execute Writeback

28 Reorder Buffer and In-Order Commit … headtail … headtail … headtail freed allocated

Reorder Buffer and In-Order Commit Instructions enter and leave ROB in program order “Architectural Register State” changes in program order Junk instructions may produce values, but their values never appear in the “Architectural Register State” Junk instructions will be flushed upon detection 29 Reorder Buffer Dest arch reg Dest phy reg Exceptions? Program Counter Branch or L/W? Ready?

Recall the Renaming Example Consider two loop iterations: Rename architectural registers to physical registers, remove false dependence and keep true dep. LWP32, P4($0); load X[i], R4 stores X ADDP33, P32, P6; X[i] = X[i] + b SWP33, P4($0); store X[i] ADDIP34, P4, 4; next element SLTP35, P34, P5; end of array? BNEP35, P0, loop ; LWP36, P34($0); load X[i], R4 stores X ADDP37, P36, P6; X[i] = X[i] + b SWP37, P34($0); store X[i] ADDIP38, P34, 4; next element SLTP38, P38, P5 ; end of array? BNER38, p0, loop ;

Architectural Register State LWR8, R4($0) ADDR8, R8, R6 SWR8, R4($0) ADDIR4, R4, 4 SLTR9, R4, R5 BNER9, R0, loop LWR8, R4($0) ADDR8, R8, R6 SWR8, R4($0) ADDIR4, R4, 4 SLTR9, R4, R5 BNER9, R0, loop Mis-predicted path R0R4R5R6R8R9 P0P4P5P6P8P9 R0R4R5R6R8R9 P0P4P5P6P8P9 architectural register mapping speculative register mapping R0R4R5R6R8R9 P0P4P5P6P8P9 R0R4R5R6R8R9 P0P34P5P6P33P35 architectural register mapping speculative register mapping R0R4R5R6R8R9 P0P34P5P6P33P35 R0R4R5R6R8R9 P0P38P5P6P37P39 architectural register mapping speculative register mapping

Summary What we have learned In-order frontend vs. out-of-order backend Branch prediction to keep instruction flow Register renaming to remove name dependence and support speculative execution Out-of-order scheduling with issue queue In-order commit with re-order buffer What we haven’t learned yet Memory disambiguation using load/queue and store queue Detail in complex real processors

SUPERSCALAR DESIGN PRIME Zhao Zhang CprE 381, Computer Organization and Assembly- Level Programming, Fall 2012 Original slides from CprE 581, Advanced.

Similar presentations

Presentation on theme: "SUPERSCALAR DESIGN PRIME Zhao Zhang CprE 381, Computer Organization and Assembly- Level Programming, Fall 2012 Original slides from CprE 581, Advanced."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SUPERSCALAR DESIGN PRIME Zhao Zhang CprE 381, Computer Organization and Assembly- Level Programming, Fall 2012 Original slides from CprE 581, Advanced.

Similar presentations

Presentation on theme: "SUPERSCALAR DESIGN PRIME Zhao Zhang CprE 381, Computer Organization and Assembly- Level Programming, Fall 2012 Original slides from CprE 581, Advanced."— Presentation transcript:

Similar presentations

About project

Feedback