Dr. George Michelogiannakis EECS, University of California at Berkeley

Slides:

Advertisements

Similar presentations

Asanovic/Devadas Spring Advanced Superscalar Architectures Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Krste Asanovic Electrical Engineering and Computer Sciences

Electrical and Computer Engineering

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

2/28/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction Krste.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP II Steve Ko Computer Sciences and Engineering University at Buffalo.

February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP III Steve Ko Computer Sciences and Engineering University at Buffalo.

March 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 14 - Advanced Superscalars Krste Asanovic Electrical Engineering.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP I Steve Ko Computer Sciences and Engineering University at Buffalo.

CS 152 Computer Architecture and Engineering Lecture 14 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Krste Asanovic Electrical Engineering and Computer Sciences

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 7 CS252 Graduate Computer Architecture Spring 2014 Lecture 7: Branch Prediction and Load-Store Queues.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

CS 152 Computer Architecture and Engineering Lecture 15 - Out-of-Order Memory, Complex Superscalars Review Krste Asanovic Electrical Engineering and Computer.

© Krste Asanovic, 2015CS252, Fall 2015, Lecture 7 CS252 Graduate Computer Architecture Spring 2014 Lecture 7: Advanced Out-of-Order Superscalar Designs.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

March 1, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

CS203 – Advanced Computer Architecture ILP and Speculation.

Lecture: Out-of-order Processors

CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering.

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction John Wawrzynek Electrical Engineering.

Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

/ Computer Architecture and Design

PowerPC 604 Superscalar Microprocessor

CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic

Lecture: Out-of-order Processors

Microprocessor Microarchitecture Dynamic Pipeline

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Electrical and Computer Engineering

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Krste Asanovic Electrical Engineering and Computer Sciences

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Adapted from the slides of Prof

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 11 – Out-of-Order Execution Krste Asanovic Electrical Engineering.

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 – Branch Prediction and Advanced Out-of-Order Superscalars.

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152 CS252 S05

Last time in Lecture 11 Register renaming removes WAR, WAW hazards In-order fetch/decode, out-of-order execute, in-order commit gives high performance and precise exceptions Need to rapidly recover on branch mispredictions Unified physical register file machines remove data values from ROB All values only read and written during execution Only register tags held in ROB CS252 S05

Question of the Day How many in-order cores do you think take up the same area as an out-of-order core?

Reminder Out-of-Order In-Order PC Fetch Decode & Rename Reorder Buffer Commit In-Order Physical Reg. File Branch Unit ALU MEM Store Buffer D$ Execute CS252 S05

Separate Pending Instruction Window from ROB Instructions that committed are not in the instruction window any more The instruction window holds instructions that have been decoded and renamed but not issued into execution. Has register tags and presence bits, and pointer to ROB entry. op p1 PR1 p2 PR2 PRd use ex ROB# Rd LPRd PC Except? Ptr2 next to commit Ptr1 next available Done? Reorder buffer used to hold exception information for commit. ROB is usually several times larger than instruction window – why?

Reorder Buffer Holds Active Instructions (Decoded but not Committed) … ld x1, (x3) add x3, x1, x2 sub x6, x7, x9 add x3, x3, x6 ld x6, (x1) add x6, x6, x3 sd x6, (x1) (Older instructions) … ld x1, (x3) add x3, x1, x2 sub x6, x7, x9 add x3, x3, x6 ld x6, (x1) add x6, x6, x3 sd x6, (x1) Commit Fetch Cycle t + 1 Execute (Newer instructions) Cycle t CS252 S05

Issue Timing i1 Add R1,R1,#1 Issue1 Execute1 i2 Sub R1,R1,#1 Issue2 How can we issue earlier? Using knowledge of execution latency (bypass) i1 Add R1,R1,#1 Issue1 Execute1 i2 Sub R1,R1,#1 Issue2 Execute2 What makes this schedule fail? If execution latency wasn’t as expected

Issue Queue with latency prediction Inst# use exec op p1 lat1 src1 p2 lat2 src2 dest ptr2 next to commit BEQZ Speculative Instructions ptr1 next available Issue Queue (Reorder buffer) Fixed latency: latency included in queue entry (‘bypassed’) Predicted latency: latency included in queue entry (speculated) Variable latency: wait for completion signal (stall)

Improving Instruction Fetch Performance of speculative out-of-order machines often limited by instruction fetch bandwidth speculative execution can fetch 2-3x more instructions than are committed mispredict penalties dominated by time to refill instruction window taken branches are particularly troublesome

Increasing Taken Branch Bandwidth (Alpha 21264 I-Cache) Cached Instructions Line Predict Way Predict Tag Way 1 =? fast fetch path PC Generation PC Branch Prediction Instruction Decode Validity Checks 4 insts Hit/Miss/Way Fold 2-way tags and BTB into predicted next block Take tag checks, inst. decode, branch predict out of loop Raw RAM speed on critical loop (1 cycle at ~1 GHz) 2-bit hysteresis counter per block prevents overtraining

Tournament Branch Predictor (Alpha 21264) Local history table (1,024x10b) Local prediction (1,024x3b) Global Prediction (4,096x2b) Choice Prediction (4,096x2b) PC Prediction Global History (12b) Choice predictor learns whether best to use local or global branch history in predicting next branch (best in each case) Global history is speculatively updated but restored on mispredict Claim 90-100% success on range of applications

Taken Branch Limit Integer codes have a taken branch every 6-9 instructions To avoid fetch bottleneck, must execute multiple taken branches per cycle when increasing performance This implies: predicting multiple branches per cycle fetching multiple non-contiguous blocks per cycle

Branch Address Cache (Yeh, Marr, Patt) Entry PC Valid valid predicted target#1 target #1 len len#1 predicted target #2 k PC = match target#2 Extend BTB to return multiple branch predictions per cycle

Fetching Multiple Basic Blocks Requires either multiported cache: expensive interleaving: bank conflicts will occur Merging multiple blocks to feed to decoders adds latency increasing mispredict penalty and reducing branch throughput

Trace Cache Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line BR BR BR BR Single fetch brings in multiple basic blocks Trace cache indexed by start address and next n branch predictions Used in Intel Pentium-4 processor to hold decoded uops

Superscalar Register Renaming During decode, instructions allocated new physical destination register Source operands renamed to physical register with newest value Execution unit only sees physical register numbers Inst 1 Inst 2 Op Src1 Src2 Dest Op Src1 Src2 Dest Update Mapping Rename Table Read Addresses Register Free List Write Ports Read Data Issue multiple instructions per cycle Op PSrc1 PSrc2 PDest Op PSrc1 PSrc2 PDest Does this work? CS252 S05

Superscalar Register Renaming Inst 1 Inst 2 Op Src1 Src2 Dest Op Src1 Src2 Dest Update Mapping Rename Table Read Addresses Register Free List Write Ports =? =? Read Data Must check for RAW hazards between instructions issuing in same cycle. Can be done in parallel with rename lookup. Doesn’t show that update reflects last dest. (jse) Op PSrc1 PSrc2 PDest Op PSrc1 PSrc2 PDest MIPS R10K renames 4 serially-RAW-dependent insts/cycle CS252 S05

Speculative Loads / Stores Just like register updates, stores should not modify the memory until after the instruction is committed - A speculative store buffer is a structure introduced to hold speculative store data. CS252 S05

Speculative Store Buffer Store Address Store Data Just like register updates, stores should not modify the memory until after the instruction is committed. A speculative store buffer is a structure introduced to hold speculative store data. During decode, store buffer slot allocated in program order Stores split into “store address” and “store data” micro-operations “Store address” execute writes tag “Store data” execute writes data Store commits when oldest instruction and both address and data available: clear speculative bit and eventually move data to cache On store abort: clear valid bit Speculative Store Buffer Tag Data S V Tag Data S V Tag Data S V Tag Data S V Tag Data S V Tag Data S V Store Commit Path Tags Data L1 Data Cache CS252 S05

Load bypass from speculative store buffer Load Address L1 Data Cache Tag Data S V Tags Data Tag Data S V Tag Data S V Tag Data S V Tag Data S V Tag Data S V Load Data If data in both store buffer and cache, which should we use? Speculative store buffer If same address in store buffer twice, which should we use? Youngest store older than load CS252 S05

When can we execute the load? Memory Dependencies sd x1, (x2) ld x3, (x4) When can we execute the load? CS252 S05

In-Order Memory Queue Execute all loads and stores in program order => Load and store cannot leave ROB for execution until all previous loads and stores have completed execution Can still execute loads and stores speculatively, and out-of-order with respect to other instructions Need a structure to handle memory ordering… CS252 S05

Conservative O-o-O Load Execution sd x1, (x2) ld x3, (x4) Can execute load before store, if addresses known and x4 != x2 Each load address compared with addresses of all previous uncommitted stores can use partial conservative check i.e., bottom 12 bits of address, to save hardware Don’t execute load if any previous store address not known (MIPS R10K, 16-entry address queue) CS252 S05

Address Speculation sd x1, (x2) ld x3, (x4) Guess that x4 != x2 Execute load before store address known Need to hold all completed but uncommitted load/store addresses in program order If subsequently find x4==x2, squash load and all following instructions => Large penalty for inaccurate address speculation CS252 S05

Memory Dependence Prediction (Alpha 21264) sd x1, (x2) ld x3, (x4) Guess that x4 != x2 and execute load before store If later find x4==x2, squash load and all following instructions, but mark load instruction as store-wait Subsequent executions of the same load instruction will wait for all previous stores to complete Periodically clear store-wait bits CS252 S05

Datapath: Branch Prediction and Speculative Execution Resolution kill PC Fetch Decode & Rename Reorder Buffer Commit Reg. File Branch Unit ALU MEM Store Buffer D$ Execute CS252 S05

Instruction Flow in Unified Physical Register File Pipeline Fetch Get instruction bits from current guess at PC, place in fetch buffer Update PC using sequential address or branch predictor (BTB) Decode/Rename Take instruction from fetch buffer Allocate resources to execute instruction: Destination physical register, if instruction writes a register Entry in reorder buffer to provide in-order commit Entry in issue window to wait for execution Entry in memory buffer, if load or store Decode will stall if resources not available Rename source and destination registers Check source registers for readiness Insert instruction into issue window+reorder buffer+memory buffer

Memory Instructions Split store instruction into two pieces during decode: Address calculation, store-address Data movement, store-data Allocate space in program order in memory buffers during decode Store instructions: Store-address calculates address and places in store buffer Store-data copies store value into store buffer Store-address and store-data execute independently out of issue window Stores only commit to data cache at commit point Load instructions: Load address calculation executes from window Load with completed effective address searches memory buffer Load instruction may have to wait in memory buffer for earlier store ops to resolve

Issue Stage Writebacks from completion phase “wakeup” some instructions by causing their source operands to become ready in issue window In more speculative machines, might wake up waiting loads in memory buffer Need to “select” some instructions for issue Arbiter picks a subset of ready instructions for execution Example policies: random, lower-first, oldest-first, critical-first Instructions read out from issue window and sent to execution

Execute Stage Read operands from physical register file and/or bypass network from other functional units Execute on functional unit Write result value to physical register file (or store buffer if store) Produce exception status, write to reorder buffer Free slot in instruction window

Commit Stage Read completed instructions in-order from reorder buffer (may need to wait for next oldest instruction to complete) If exception raised flush pipeline, jump to exception handler Otherwise, release resources: Free physical register used by last writer to same architectural register Free reorder buffer slot Free memory reorder buffer slot

Question of the Day How many in-order cores do you think take up the same area as an out-of-order core?

How Many In-Order In The Same Area?

Performance?

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 CS252 S05