ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based.

Slides:



Advertisements
Similar presentations
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Advertisements

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
February 28, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 11 - Out-of-Order Issue, Register Renaming, & Branch Prediction.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CS203 – Advanced Computer Architecture ILP and Speculation.
ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
CSL718 : Superscalar Processors
/ Computer Architecture and Design
PowerPC 604 Superscalar Microprocessor
Lecture: Out-of-order Processors
Microprocessor Microarchitecture Dynamic Pipeline
Flow Path Model of Superscalars
Lecture 6: Advanced Pipelines
Out of Order Processors
Superscalar Processors & VLIW Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
* From AMD 1996 Publication #18522 Revision E
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end,
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
ECE/CS 552: Introduction to Superscalar Processors
Presentation transcript:

ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen

© Shen, Lipasti 2 Limitations of Scalar Pipelines Scalar upper bound on throughput –IPC = 1 Inefficient unified pipeline –Long latency for each instruction Rigid pipeline stall policy –One stalled instruction stalls all newer instructions

© Shen, Lipasti 3 Parallel Pipelines

© Shen, Lipasti 4 Intel Pentium Parallel Pipeline

© Shen, Lipasti 5 Diversified Pipelines

© Shen, Lipasti 6 Power4 Diversified Pipelines PC I-Cache BR Scan BR Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit BR Unit FX/LD 1 Issue Q FX1 Unit LD1 Unit FX/LD 2 Issue Q LD2 Unit FX2 Unit FP Issue Q FP1 Unit FP2 Unit StQ D-Cache

© Shen, Lipasti 7 Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Stalled Instruction Backward Propagation of Stalling Not Allowed

© Shen, Lipasti 8 Dynamic Pipelines

© Shen, Lipasti 9 Interstage Buffers

© Shen, Lipasti 10 Superscalar Pipeline Stages In Program Order In Program Order Out of Order

© Shen, Lipasti 11 Limitations of Scalar Pipelines Scalar upper bound on throughput –IPC = 1 –Solution: wide (superscalar) pipeline Inefficient unified pipeline –Long latency for each instruction –Solution: diversified, specialized pipelines Rigid pipeline stall policy –One stalled instruction stalls all newer instructions –Solution: Out-of-order execution, distributed execution pipelines

© Shen, Lipasti 12 Impediments to High IPC

© Shen, Lipasti 13 Superscalar Pipeline Design Instruction Fetching Issues Instruction Decoding Issues Instruction Dispatching Issues Instruction Execution Issues Instruction Completion & Retiring Issues

© Shen, Lipasti 14 Instruction Fetch Challenges: –Branches: control dependences –Branch target misalignment –Instruction cache misses Solutions –Alignment hardware –Prediction/speculation Instruction Memory PC 3 instructions fetched Objective: Fetch multiple instructions per cycle

© Shen, Lipasti 15 Fetch Alignment

© Shen, Lipasti 16 Branches – MIPS 6 Types of Branches Jump (uncond, no save PC, imm) Jump and link (uncond, save PC, imm) Jump register (uncond, no save PC, register) Jump and link register (uncond, save PC, register) Branch (conditional, no save PC, PC+imm) Branch and link (conditional, save PC, PC+imm)

© Shen, Lipasti 17 Disruption of Sequential Control Flow

© Shen, Lipasti 18 Branch Prediction Target address generation  Target Speculation –Access register: PC, General purpose register, Link register –Perform calculation: +/- offset, autoincrement Condition resolution  Condition speculation –Access register: Condition code register, General purpose register –Perform calculation: Comparison of data register(s)

© Shen, Lipasti 19 Target Address Generation

© Shen, Lipasti 20 Condition Resolution

© Shen, Lipasti 21 Branch Instruction Speculation

© Shen, Lipasti 22 Static Branch Prediction Single-direction –Always not-taken: Intel i486 Backwards Taken/Forward Not Taken –Loop-closing branches have negative offset –Used as backup in Pentium Pro, II, III, 4

© Shen, Lipasti 23 Static Branch Prediction Profile-based 1.Instrument program binary 2.Run with representative (?) input set 3.Recompile program a. Annotate branches with hint bits, or b. Restructure code to match predict not-taken Performance: 75-80% accuracy –Much higher for “easy” cases

© Shen, Lipasti 24 Dynamic Branch Prediction Main advantages: –Learn branch behavior autonomously No compiler analysis, heuristics, or profiling –Adapt to changing branch behavior Program phase changes branch behavior First proposed in 1980 –US Patent #4,370,711, Branch predictor using random access memory, James. E. Smith Continually refined since then

© Shen, Lipasti 25 Smith Predictor Hardware Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages , May 1981 Widely employed: Intel Pentium, PowerPC 604, PowerPC 620, etc.

© Shen, Lipasti 26 Two-level Branch Prediction BHR adds global branch history –Provides more context –Can differentiate multiple instances of the same static branch –Can correlate behavior across multiple static branches

© Shen, Lipasti 27 Combining or Hybrid Predictors Select “best” history Reduce interference w/partial updates Scott McFarling. Combining Branch Predictors. TN-36, Digital Equipment Corporation Western Research Laboratory, June 1993.

© Shen, Lipasti 28 Branch Target Prediction Partial tags sufficient in BTB

© Shen, Lipasti 29 Return Address Stack For each call/return pair: –Call: push return address onto hardware stack –Return: pop return address from hardware stack

© Shen, Lipasti 30 Branch Speculation Leading Speculation –Typically done during the Fetch stage –Based on potential branch instruction(s) in the current fetch group Trailing Confirmation –Typically done during the Branch Execute stage –Based on the next Branch instruction to finish execution

© Shen, Lipasti 31 Branch Speculation Leading Speculation 1.Tag speculative instructions 2.Advance branch and following instructions 3.Buffer addresses of speculated branch instructions Trailing Confirmation 1.When branch resolves, remove/deallocate speculation tag 2.Permit completion of branch and following instructions

© Shen, Lipasti 32 Branch Speculation Start new correct path –Must remember the alternate (non-predicted) path Eliminate incorrect path –Must ensure that the mis-speculated instructions produce no side effects

© Shen, Lipasti 33 Mis-speculation Recovery Start new correct path 1.Update PC with computed branch target (if predicted NT) 2.Update PC with sequential instruction address (if predicted T) 3.Can begin speculation again at next branch Eliminate incorrect path 1.Use tag(s) to deallocate resources occupied by speculative instructions 2.Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations

© Shen, Lipasti 34 Summary: Instruction Fetch Fetch group alignment Target address generation –Branch target buffer –Return address stack Target condition generation –Static prediction –Dynamic prediction Speculative execution –Tagging/tracking instructions –Recovering from mispredicted branches

© Shen, Lipasti 35 Issues in Decoding Primary Tasks –Identify individual instructions (!) –Determine instruction types –Determine dependences between instructions Two important factors –Instruction set architecture –Pipeline width

© Shen, Lipasti 36 Pentium Pro Fetch/Decode

© Shen, Lipasti 37 Predecoding in the AMD K5

© Shen, Lipasti 38 Dependence Checking Trailing instructions in fetch group –Check for dependence on leading instructions DestSrc0Src1DestSrc0Src1DestSrc0Src1DestSrc0Src1 ?=

© Shen, Lipasti 39 Instruction Dispatch and Issue Parallel pipeline –Centralized instruction fetch –Centralized instruction decode Diversified pipeline –Distributed instruction execution

© Shen, Lipasti 40 Necessity of Instruction Dispatch

© Shen, Lipasti 41 Centralized Reservation Station

© Shen, Lipasti 42 Distributed Reservation Station

© Shen, Lipasti 43 Issues in Instruction Execution Parallel execution units –Bypassing is a real challenge Resolving register data dependences –Want out-of-order instruction execution Resolving memory data dependences –Want loads to issue as soon as possible Maintaining precise exceptions –Required by the ISA

© Shen, Lipasti 44 Bypass Networks O(n 2 ) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) –Use RF only (IBM Power4) with no bypass network –Decompose into clusters (Alpha 21264) PC I-Cache BR Scan BR Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit BR Unit FX/LD 1 Issue Q FX1 Unit LD1 Unit FX/LD 2 Issue Q LD2 Unit FX2 Unit FP Issue Q FP1 Unit FP2 Unit StQ D-Cache

© Shen, Lipasti 45 The Big Picture INSTRUCTION PROCESSING CONSTRAINTS Resource ContentionCode Dependences Control Dependences Data Dependences True Dependences Anti-Dependences Output Dependences Storage Conflicts (Structural Dependences) (RAW) (WAR) (WAW)

© Shen, Lipasti 46 Register Data Dependences Program data dependences cause hazards –True dependences (RAW) –Antidependences (WAR) –Output dependences (WAW) When are registers read and written? –Out of program order! –Hence, any/all of these can occur Solution to all three: register renaming

© Shen, Lipasti 47 Register Renaming: WAR/WAW Widely employed (Core i7, Athlon/Phenom, …) Resolving WAR/WAW: –Each register write gets unique “rename register” –Writes are committed in program order at Writeback –WAR and WAW are not an issue All updates to “architected state” delayed till writeback Writeback stage always later than read stage –Reorder Buffer (ROB) enforces in-order writeback Add R3 <= …P32 <= … Sub R4 <= …P33 <= … And R3 <= …P35 <= …

© Shen, Lipasti 48 Register Renaming: RAW In order, at dispatch: –Source registers checked to see if “in flight” Register map table keeps track of this If not in flight, can be read from the register file If in flight, look up “rename register” tag (IOU) –Then, allocate new register for register write Add R3 <= R2 + R1P32 <= P2 + P1 Sub R4 <= R3 + R1P33 <= P32 + P1 And R3 <= R4 & R2P35 <= P33 + P2

© Shen, Lipasti 49 Register Renaming: RAW Advance instruction to reservation station –Wait for rename register tag to trigger issue Reservation station enables out-of-order issue –Newer instructions can bypass stalled instructions

© Shen, Lipasti 50 “Dataflow Engine” for Dynamic Execution Stations Reorder Buffer Branch Reg. FileRen. Reg. Forwarding results to Res. Sta. & Allocate Reorder Buffer entries Reg. Write Back rename Managed as a queue; Maintains sequential order of all Instructions in flight Integer Float.- Load/ Point Store registers

© Shen, Lipasti 51 Instruction Processing Steps DISPATCH: Read operands from Register File (RF) and/or Rename Buffers (RRB) Rename destination register and allocate RRF entry Allocate Reorder Buffer (ROB) entry Advance instruction to appropriate Reservation Station (RS) EXECUTE: RS entry monitors bus for register Tag(s) to latch in pending operand(s) When all operands ready, issue instruction into Functional Unit (FU) and deallocate RS entry (no further stalling in execution pipe) When execution finishes, broadcast result to waiting RS entries, RRB entry, and ROB entry COMPLETE: Update architected register from RRB entry, deallocate RRB entry, and if it is a store instruction, advance it to Store Buffer Deallocate ROB entry and instruction is considered architecturally completed

© Shen, Lipasti 52 Physical Register File Used in MIPS R10000, Pentium 4, AMD Bulldozer All registers in one place –Always accessed right before EX stage –No copying to real register file at commit Fetch Decode Rename Issue RF Read Execute Agen-D$ ALU RF Write D$ Load RF Write Physical Register File Map Table R0 => P7 R1 => P3 … R31 => P39

© Shen, Lipasti 53 Managing Physical Registers What to do when all physical registers are in use? –Must release them somehow to avoid stalling –Maintain free list of “unused” physical registers Release when no more uses are possible –Sufficient: next write commits Map Table R0 => P7 R1 => P3 … R31 => P39 Add R3 <= R2 + R1P32 <= P2 + P1 Sub R4 <= R3 + R1P33 <= P32 + P1 … And R3 <= R4 & R2P35 <= P33 + P2 Release P32 (previous R3) when this instruction completes execution

© Shen, Lipasti 54 Memory Data Dependences WAR/WAW: stores commit in order –Hazards not possible. Why? RAW: loads must check pending stores –Store queue keeps track of pending store addresses –Loads check against these addresses –Similar to register bypass logic –Comparators are 32 or 64 bits wide (address size) Major source of complexity in modern designs –Store queue lookup is position-based –What if store address is not yet known? Store Queue Load/Store RS Agen Reorder Buffer Mem

© Shen, Lipasti 55 Increasing Memory Bandwidth

© Shen, Lipasti 56 Issues in Completion/Retirement Out-of-order execution –ALU instructions –Load/store instructions In-order completion/retirement –Precise exceptions Solutions –Reorder buffer retires instructions in order –Store queue retires stores in order –Exceptions can be handled at any instruction boundary by reconstructing state out of ROB/SQ

© Shen, Lipasti 57 A Dynamic Superscalar Processor

© Shen, Lipasti 58 Superscalar Summary

© Shen, Lipasti 59 [John DeVale & Bryan Black, 2005]

© Shen, Lipasti 60 Superscalar Summary Instruction flow –Branches, jumps, calls: predict target, direction –Fetch alignment –Instruction cache misses Register data flow –Register renaming: RAW/WAR/WAW Memory data flow –In-order stores: WAR/WAW –Store queue: RAW –Data cache misses: missed load buffers