ECE/CS 552: Introduction to Superscalar Processors

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
CS203 – Advanced Computer Architecture ILP and Speculation.
ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Instruction Level Parallelism
/ Computer Architecture and Design
Morgan Kaufmann Publishers
PowerPC 604 Superscalar Microprocessor
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
Microprocessor Microarchitecture Dynamic Pipeline
Flow Path Model of Superscalars
Morgan Kaufmann Publishers The Processor
Instruction Level Parallelism and Superscalar Processors
High-level view Out-of-order pipeline
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Out of Order Processors
Superscalar Processors & VLIW Processors
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CSCE 432/832 High Performance Processor Architectures Superscalar Organization Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,
Lecture 11: Memory Data Flow Techniques
Ka-Ming Keung Swamy D Ponpandi
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Sampoorani, Sivakumar and Joshua
* From AMD 1996 Publication #18522 Revision E
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
CS203 – Advanced Computer Architecture
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end,
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti TA: Daniel Chang Section 1 Fall 2005 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen

Limitations of Scalar Pipelines Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Inefficient unified pipeline Long latency for each instruction Rigid pipeline stall policy One stalled instruction stalls all newer instructions © Shen, Lipasti

Parallel Pipelines Temporal vs. Spatial vs. Both © Shen, Lipasti

Intel Pentium Parallel Pipeline 486 pipeline on the left: 2 decode stages due to complex ISA Pentium parallel pipeline: U pipe is universal (can handle any op), V pipe can’t handle the most complex ops Stages: Fetch and align, decode & generate control word, decode control word & gen mem addr, ALU or D$ Used branch prediction Approximates dual-ported D$ with 8-way interleaved banks; have to deal with occasional bank conflicts © Shen, Lipasti

Diversified Pipelines Unified pipelines are inefficient and unnecessary. In a scalar organization they make sense. With multiple issue, specialized pipelines make much more sense. Note that all instructions are treated identically in IF, ID, (also RD, more or less), and WB. Why? Because they behave very much the same way. What mix? 1st generation: Int & FPU, later: more; must determine for your workload. © Shen, Lipasti

Power4 Diversified Pipelines PC I-Cache BR Scan Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit FX/LD 1 FX1 LD1 FX/LD 2 LD2 FX2 FP FP1 FP2 StQ D-Cache IBM Power4, introduced in 2001. Leadership performance at 1.3GHz. Most recently used in the Apple PowerMac G5, running at 2GHz. The G5 also has an Altivec unit for media processing instructions, not shown in this diagram. All units shown are fully pipelined Note two clusters of integer units, separate FP cluster, also BR and CR units. © Shen, Lipasti

Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Stalled Backward Propagation of Stalling Not Allowed Think McDonalds: even though looking over the menu is pipelined with ordering, food prep, and delivery, the two are stalled: if the person in front of you has a special order (no ketchup or pickles) and the employee can’t find it under the heatlamp, all customers behind this customer are stalled. Stupid. Wendy’s model is much better, where ordering is in-order, but food prep and order delivery occur out of order. Although I noted recently that Wendy’s is giving up on this as well. © Shen, Lipasti

Dynamic Pipelines © Shen, Lipasti Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end, OOO or dynamic execution in a micro-dataflow-machine, in-order backend Interlock hardware (later) maintains dependences Reorder buffer tracks completion, exceptions, provides precise interrupts: drain pipeline, restart Inorder machine state follows the sequential execution model inherited from nonpipelined/pipelined machines © Shen, Lipasti

Interstage Buffers © Shen, Lipasti Key differentiator for OOO pipelines Scalar pipe: just pipeline latches or flip-flops In-order superscalar pipe: just wider ones Out-of-order: start to look more like register files, with random access necessary, or shift registers. May require effective crossbar between slots before/after buffer May need to be a multiported CAM © Shen, Lipasti

Superscalar Pipeline Stages Program Order Out of Order In Program Order © Shen, Lipasti

Limitations of Scalar Pipelines Scalar upper bound on throughput IPC <= 1 or CPI >= 1 Solution: wide (superscalar) pipeline Inefficient unified pipeline Long latency for each instruction Solution: diversified, specialized pipelines Rigid pipeline stall policy One stalled instruction stalls all newer instructions Solution: Out-of-order execution, distributed execution pipelines © Shen, Lipasti

Impediments to High IPC Any of the three flows can become performance bottleneck. Must address all in a balanced fashion. © Shen, Lipasti

Superscalar Pipeline Design Instruction Fetching Issues Instruction Decoding Issues Instruction Dispatching Issues Instruction Execution Issues Instruction Completion & Retiring Issues © Shen, Lipasti

Instruction Fetch Objective: Fetch multiple instructions per cycle Challenges: Branches: control dependences Branch target misalignment Instruction cache misses Solutions Code alignment (static vs.dynamic) Prediction/speculation Instruction Memory PC 3 instructions fetched Don’t starve the pipeline: n/cycle Must fetch n/cycle from I$ © Shen, Lipasti

Fetch Alignment © Shen, Lipasti Fetch size n=4: losing fetch bandwidth if not aligned Static/compiler: align branch targets at 00 (may need to pad with NOPs); implementation specific © Shen, Lipasti

Branches – MIPS 6 Types of Branches Jump (uncond, no save PC, imm) Jump and link (uncond, save PC, imm) Jump register (uncond, no save PC, register) Jump and link register (uncond, save PC, register) Branch (conditional, no save PC, PC+imm) Branch and link (conditional, save PC, PC+imm) © Shen, Lipasti

Disruption of Sequential Control Flow Penalty (increased lost opportunity cost): 1) target addr. Generation, 2) condition resolution Sketch Tyranny of Amdahl’s Law pipeline diagram © Shen, Lipasti

Branch Prediction Target address generation  Target Speculation Access register: PC, General purpose register, Link register Perform calculation: +/- offset, autoincrement Condition resolution  Condition speculation Condition code register, General purpose register Comparison of data register(s) Non-CC architecture requires actual computation to resolve condition © Shen, Lipasti

Target Address Generation Decode: PC = PC + DISP (adder) (1 cycle penalty) Dispatch: PC = (R2) ( 2 cycle penalty) Execute: PC = (R2) + offset (3 cycle penalty) Both cond & uncond Guess target address? Keep history based on branch address © Shen, Lipasti

Condition Resolution © Shen, Lipasti Dispatch: RD ( 2 cycle penalty) Execute: Status (3 cycle penalty) Both cond & uncond Guess taken vs. not taken © Shen, Lipasti

Branch Instruction Speculation Fetch: PC = PC + 4 BTB (cache) contains k cycles of history, generates guess Execute: corrects wrong guess © Shen, Lipasti

Static Branch Prediction Single-direction Always not-taken: Intel i486 Backwards Taken/Forward Not Taken Loop-closing branches Used as backup in Pentium Pro, II, III, 4 © Shen, Lipasti

Static Branch Prediction Profile-based Instrument program binary Run with representative (?) input set Recompile program Annotate branches with hint bits, or Restructure code to match predict not-taken Best performance: 75-80% accuracy © Shen, Lipasti

Dynamic Branch Prediction Main advantages: Learn branch behavior autonomously No compiler analysis, heuristics, or profiling Adapt to changing branch behavior Program phase changes branch behavior First proposed in 1980 US Patent #4,370,711, Branch predictor using random access memory, James. E. Smith Continually refined since then © Shen, Lipasti

Smith Predictor Hardware Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135-148, May 1981 Widely employed: Intel Pentium, PowerPC 604, PowerPC 620, etc. © Shen, Lipasti

Two-level Branch Prediction BHR adds global branch history Provides more context Can differentiate multiple instances of the same static branch Can correlate behavior across multiple static branches © Shen, Lipasti

Combining or Hybrid Predictors Select “best” history Reduce interference w/partial updates Scott McFarling. Combining Branch Predictors. TN-36, Digital Equipment Corporation Western Research Laboratory, June 1993. © Shen, Lipasti

Branch Target Prediction Partial tags sufficient in BTB © Shen, Lipasti

Return Address Stack For each call/return pair: Call: push return address onto hardware stack Return: pop return address from hardware stack © Shen, Lipasti

Branch Speculation Leading Speculation Trailing Confirmation Typically done during the Fetch stage Based on potential branch instruction(s) in the current fetch group Trailing Confirmation Typically done during the Branch Execute stage Based on the next Branch instruction to finish execution © Shen, Lipasti

Trailing Confirmation Branch Speculation Leading Speculation Tag speculative instructions Advance branch and following instructions Buffer addresses of speculated branch instructions Trailing Confirmation When branch resolves, remove/deallocate speculation tag Permit completion of branch and following instructions © Shen, Lipasti

Branch Speculation Start new correct path Eliminate incorrect path Must remember the alternate (non-predicted) path Eliminate incorrect path Must ensure that the mis-speculated instructions produce no side effects “squash” on instructions following NT prediction © Shen, Lipasti

Mis-speculation Recovery Start new correct path Update PC with computed branch target (if predicted NT) Update PC with sequential instruction address (if predicted T) Can begin speculation again at next branch Eliminate incorrect path Use tag(s) to deallocate resources occupied by speculative instructions Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations © Shen, Lipasti

Summary: Instruction Fetch Fetch group alignment Target address generation Branch target buffer Return address stack Target condition generation Static prediction Dynamic prediction Speculative execution Tagging/tracking instructions Recovering from mispredicted branches © Shen, Lipasti

Issues in Decoding Primary Tasks Two important factors Identify individual instructions (!) Determine instruction types Determine dependences between instructions Two important factors Instruction set architecture Pipeline width RISC vs. CISC: inherently serial Find branches early: redirect fetch Detect dependences: nxn comparators (pairwise) RISC: fixed length, regular format, easier CISC: can be multiple stages (lots of work), P6: I$ => decode is 5 cycles, often translates into internal RISC-like uops or ROPs © Shen, Lipasti

Pentium Pro Fetch/Decode 16B/cycle delivered from I$ into FIFO instruction buffer Decoder 0 is fully general, 1 & 2 can handle only simple uops. Enter centralized window (reservation station); wait here until operands ready, structural hazards resolved. Why is this bad? Branch penalty; need a good branch predictor. Other option: predecode bits © Shen, Lipasti

Predecoding in the AMD K5 K5: notoriously late and slow, but still interesting (AMD’s first non-clone x86 processor) ~50% larger I$, predecode bits generated as instructions fetched from memory on a cache miss: Powerful principle in architecture: memoization! Predecode records start and end of x86 ops, # of ROPs, location of opcodes & prefixes Up to 4 ROPs per cycle. Also useful in RISCs: PPC 620 used 7 bits/inst PA8000, MIPS R10000 used 4/5 bits/inst These used to ID branches early, reduce branch penalty © Shen, Lipasti

Dependence Checking Trailing instructions in fetch group Dest Src0 Src1 Dest Src0 Src1 Dest Src0 Src1 Dest Src0 Src1 ?= ?= ?= ?= ?= ?= ?= ?= ?= ?= ?= ?= Trailing instructions in fetch group Check for dependence on leading instructions © Shen, Lipasti

Instruction Dispatch and Issue Parallel pipeline Centralized instruction fetch Centralized instruction decode Diversified pipeline Distributed instruction execution © Shen, Lipasti

Necessity of Instruction Dispatch Must have complex interstage buffers to hold instructions to avoid rigid pipeline © Shen, Lipasti

Centralized Reservation Station Dispatch: based on type; Issue: when instruction enters functional unit to execute (same thing here) Centralized: efficient, shared resource; has scaling problems (later) © Shen, Lipasti

Distributed Reservation Station Distributed, with localized control (easy win: break up based on data type, I.e. FP vs. integer) Less efficient utilization, but each unit is smaller since can be single-ported (for dispatch and issue) Must tune for proper utilization Must make 1000 little decisions (juggle 100 ping pong balls) © Shen, Lipasti

Issues in Instruction Execution Parallel execution units Bypassing is a real challenge Resolving register data dependences Want out-of-order instruction execution Resolving memory data dependences Want loads to issue as soon as possible Maintaining precise exceptions Required by the ISA © Shen, Lipasti

Bypass Networks O(n2) interconnect from/to FU inputs and outputs PC I-Cache BR Scan Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit FX/LD 1 FX1 LD1 FX/LD 2 LD2 FX2 FP FP1 FP2 StQ D-Cache O(n2) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) Use RF only (IBM Power4) with no bypass network Decompose into clusters (Alpha 21264) Draw bypass between integer/br/cr units; 4 sources, 12 sinks © Shen, Lipasti

The Big Picture INSTRUCTION PROCESSING CONSTRAINTS Resource Contention Code Dependences Control Dependences Data Dependences T rue Dependences Anti-Dependences Output Dependences Storage Conflicts (Structural Dependences) (RAW) (WAR) (WAW) What can slow us down? Structural: organization (pipeline, Fus) Control dependences – covered Data dependences – coming up © Shen, Lipasti

Register Data Dependences Program data dependences cause hazards True dependences (RAW) Antidependences (WAR) Output dependences (WAW) When are registers read and written? Out of program order! Hence, any/all of these can occur Solution to all three: register renaming © Shen, Lipasti

Register Renaming: WAR/WAW Widely employed (P-4, P-M, Athlon, …) Resolving WAR/WAW: Each register write gets unique “rename register” Writes are committed in program order at Writeback WAR and WAW are not an issue All updates to “architected state” delayed till writeback Writeback stage always later than read stage Reorder Buffer (ROB) enforces in-order writeback Add R3 <= … P32 <= … Sub R4 <= … P33 <= … And R3 <= … P35 <= … © Shen, Lipasti

Register Renaming: RAW In order, at dispatch: Source registers checked to see if “in flight” Register map table keeps track of this If not in flight, read from real register file If in flight, look up “rename register” tag (IOU) Then, allocate new register for register write Add R3 <= R2 + R1 P32 <= P2 + P1 Sub R4 <= R3 + R1 P33 <= P32 + P1 And R3 <= R4 & R2 P35 <= P33 + P2 © Shen, Lipasti

Register Renaming: RAW Advance instruction to reservation station Wait for rename register tag to trigger issue Reservation station enables out-of-order issue Newer instructions can bypass stalled instructions © Shen, Lipasti

“Dataflow Engine” for Dynamic Execution Dispatch Buffer Reservation Dispatch Complete Reg. Write Back Reg. File Ren. Reg. Allocate Reorder Buffer entries Stations Branch Integer Integer Float.- Load/ Point Forwarding Store results to Res. Sta. & Dispatch: read, allocate rename register, allocate ROB, dispatch to RS (tag paths from RF and RRF as “if ready”) rename registers Reorder Buffer Managed as a queue; Maintains sequential order of all Instructions in flight © Shen, Lipasti

Instruction Processing Steps DISPATCH: Read operands from Register File (RF) and/or Rename Buffers (RRB) Rename destination register and allocate RRF entry Allocate Reorder Buffer (ROB) entry Advance instruction to appropriate Reservation Station (RS) EXECUTE: RS entry monitors bus for register Tag(s) to latch in pending operand(s) When all operands ready, issue instruction into Functional Unit (FU) and deallocate RS entry (no further stalling in execution pipe) When execution finishes, broadcast result to waiting RS entries, RRB entry, and ROB entry COMPLETE: Update architected register from RRB entry, deallocate RRB entry, and if it is a store instruction, advance it to Store Buffer Deallocate ROB entry and instruction is considered architecturally completed Execute: prioritize which (if multiple): select logic deallocate RS (except for L/S) - stall on result bus arb (if not dedicated bus) © Shen, Lipasti

Physical Register File Fetch Decode Rename RF Read Issue Execute Agen-D$ ALU RF Write D$ Load RF Write Map Table R0 => P7 R1 => P3 … R31 => P39 Physical Register File Used in the MIPS R10000 pipeline All registers in one place Always accessed right before EX stage No copying to real register file © Shen, Lipasti

Managing Physical Registers Add R3 <= R2 + R1 P32 <= P2 + P1 Sub R4 <= R3 + R1 P33 <= P32 + P1 … And R3 <= R4 & R2 P35 <= P33 + P2 Release P32 (previous R3) when this instruction completes execution Map Table R0 => P7 R1 => P3 … R31 => P39 What to do when all physical registers are in use? Must release them somehow to avoid stalling Maintain free list of “unused” physical registers Release when no more uses are possible Sufficient: next write commits © Shen, Lipasti

Memory Data Dependences Store Queue Load/Store RS Agen Reorder Buffer Mem WAR/WAW: stores commit in order Hazards not possible. Why? RAW: loads must check pending stores Store queue keeps track of pending store addresses Loads check against these addresses Similar to register bypass logic Comparators are 32 or 64 bits wide (address size) Major source of complexity in modern designs Store queue lookup is position-based What if store address is not yet known? © Shen, Lipasti

Increasing Memory Bandwidth Duplicating load/store pipelines is fundamentally harder because they contain shared state (in the cache) Loads that miss can be set aside in a “missed load buffer” or “miss status handling register” (MSHR) Subsequent loads that hit can still be handled © Shen, Lipasti

Issues in Completion/Retirement Out-of-order execution ALU instructions Load/store instructions In-order completion/retirement Precise exceptions Solutions Reorder buffer retires instructions in order Store queue retires stores in order Exceptions can be handled at any instruction boundary by reconstructing state out of ROB/SQ Precise exception – clean instr. Boundary for restart Memory consistency – WAW, also subtle multiprocessor issues (in 757) Memory coherence – RAR expect later load to also see new value seen by earlier load © Shen, Lipasti

A Dynamic Superscalar Processor ROB – preallocated at dispatch: bookkeeping, store results, forward results (possibly) Complete – commit results to RF (no way to undo) Retire – memory update: delay stores to let loads go early © Shen, Lipasti

Superscalar Summary © Shen, Lipasti

[John DeVale & Bryan Black, 2005] © Shen, Lipasti

Superscalar Summary Instruction flow Register data flow Branches, jumps, calls: predict target, direction Fetch alignment Instruction cache misses Register data flow Register renaming: RAW/WAR/WAW Memory data flow In-order stores: WAR/WAW Store queue: RAW Data cache misses: missed load buffers © Shen, Lipasti