CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti,

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
COMP4611 Tutorial 6 Instruction Level Parallelism
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
Appendix A Pipelining: Basic and Intermediate Concepts
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
EKT303/4 Superscalar vs Super-pipelined.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
PipeliningPipelining Computer Architecture (Fall 2006)
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Advanced Architectures
Instruction Level Parallelism
15-740/ Computer Architecture Lecture 21: Superscalar Processing
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Morgan Kaufmann Publishers
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
Flow Path Model of Superscalars
Pipelining: Advanced ILP
Instruction Level Parallelism and Superscalar Processors
Superscalar Processors & VLIW Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.
Ka-Ming Keung Swamy D Ponpandi
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
ECE/CS 552: Pipelining to Superscalar
CSC3050 – Computer Architecture
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar1 CIS 501: Computer Architecture Unit 8: Superscalar Pipelines Slides developed by Joe Devietti, Milo Martin & Amir Roth at UPenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar2 A Key Theme: Parallelism Previously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next Next: instruction-level parallelism (ILP) Execute multiple independent instructions fully in parallel Then: Static & dynamic scheduling Extract much more ILP Data-level parallelism (DLP) Single-instruction, multiple data (one insn., four 64-bit adds) Thread-level parallelism (TLP) Multiple software threads running on multiple cores

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar3 This Unit: (In-Order) Superscalar Pipelines Idea of instruction-level parallelism Superscalar hardware issues Bypassing and register file Stall logic Fetch “Superscalar” vs VLIW/EPIC CPUMemI/O System software App

Readings Textbook (MA:FSPTCM) Sections 3.1, 3.2 (but not “Sidebar” in 3.2), Sections 4.2, 4.3, CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar4

5 “Scalar” Pipeline & the Flynn Bottleneck So far we have looked at scalar pipelines One instruction per stage With control speculation, bypassing, etc. –Performance limit (aka “Flynn Bottleneck”) is CPI = IPC = 1 –Limit is never even achieved (hazards) –Diminishing returns from “super-pipelining” (hazards + overhead) regfile D$ I$ BPBP

An Opportunity… But consider: ADD r1, r2 -> r3 ADD r4, r5 -> r6 Why not execute them at the same time? (We can!) What about: ADD r1, r2 -> r3 ADD r4, r3 -> r6 In this case, dependences prevent parallel execution What about three instructions at a time? Or four instructions at a time? CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar6

What Checking Is Required? For two instructions: 2 checks ADD src1 1, src2 1 -> dest 1 ADD src1 2, src2 2 -> dest 2 (2 checks) For three instructions: 6 checks ADD src1 1, src2 1 -> dest 1 ADD src1 2, src2 2 -> dest 2 (2 checks) ADD src1 3, src2 3 -> dest 3 (4 checks) For four instructions: 12 checks ADD src1 1, src2 1 -> dest 1 ADD src1 2, src2 2 -> dest 2 (2 checks) ADD src1 3, src2 3 -> dest 3 (4 checks) ADD src1 4, src2 4 -> dest 4 (6 checks) Plus checking for load-to-use stalls from prior n loads CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar7

What Checking Is Required? For two instructions: 2 checks ADD src1 1, src2 1 -> dest 1 ADD src1 2, src2 2 -> dest 2 (2 checks) For three instructions: 6 checks ADD src1 1, src2 1 -> dest 1 ADD src1 2, src2 2 -> dest 2 (2 checks) ADD src1 3, src2 3 -> dest 3 (4 checks) For four instructions: 12 checks ADD src1 1, src2 1 -> dest 1 ADD src1 2, src2 2 -> dest 2 (2 checks) ADD src1 3, src2 3 -> dest 3 (4 checks) ADD src1 4, src2 4 -> dest 4 (6 checks) Plus checking for load-to-use stalls from prior n loads CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar8

How do we build such “superscalar” hardware? CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar9

10 Multiple-Issue or “Superscalar” Pipeline Overcome this limit using multiple issue Also called superscalar Two instructions per stage at once, or three, or four, or eight… “Instruction-Level Parallelism (ILP)” [Fisher, IEEE TC’81] Today, typically “4-wide” (Intel Core i7, AMD Opteron) Some more (Power5 is 5-issue; Itanium is 6-issue) Some less (dual-issue is common for simple cores) regfile D$ I$ BPBP

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar11 A Typical Dual-Issue Pipeline (1 of 2) Fetch an entire 16B or 32B cache block 4 to 8 instructions (assuming 4-byte average instruction length) Predict a single branch per cycle Parallel decode Need to check for conflicting instructions Is output register of I 1 is an input register to I 2 ? Other stalls, too (for example, load-use delay) regfile D$ I$ BPBP

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar12 A Typical Dual-Issue Pipeline (2 of 2) Multi-ported register file Larger area, latency, power, cost, complexity Multiple execution units Simple adders are easy, but bypass paths are expensive Memory unit Single load per cycle (stall at decode) probably okay for dual issue Alternative: add a read port to data cache Larger area, latency, power, cost, complexity regfile D$ I$ BPBP

Superscalar Implementation Challenges CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar13

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar14 Superscalar Challenges - Front End Superscalar instruction fetch Modest: fetch multiple instructions per cycle Aggressive: buffer instructions and/or predict multiple branches Superscalar instruction decode Replicate decoders Superscalar instruction issue Determine when instructions can proceed in parallel More complex stall logic - O(N 2 ) for N-wide machine Not all combinations of types of instructions possible Superscalar register read Port for each register read (4-wide superscalar  8 read “ports”) Each port needs its own set of address and data wires Latency & area  #ports 2

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar15 Superscalar Challenges - Back End Superscalar instruction execution Replicate arithmetic units (but not all, say, integer divider) Perhaps multiple cache ports (slower access, higher energy) Only for 4-wide or larger (why? only ~35% are load/store insn) Superscalar bypass paths More possible sources for data values O(PN 2 ) for N-wide machine with execute pipeline depth P Superscalar instruction register writeback One write port per instruction that writes a register Example, 4-wide superscalar  4 write ports Fundamental challenge: Amount of ILP (instruction-level parallelism) in the program Compiler must schedule code and extract parallelism

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar16 Superscalar Bypass N 2 bypass network –(N+1)-input muxes at each ALU input –N 2 point-to-point connections –Routing lengthens wires –Heavy capacitive load And this is just one bypass stage (MX)! There is also WX bypassing Even more for deeper pipelines One of the big problems of superscalar Why? On the critical path of single-cycle “bypass & execute” loop versus

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar17 Not All N 2 Created Equal N 2 bypass vs. N 2 stall logic & dependence cross-check Which is the bigger problem? N 2 bypass … by far 64- bit quantities (vs. 5-bit) Multiple levels (MX, WX) of bypass (vs. 1 level of stall logic) Must fit in one clock period with ALU (vs. not) Dependence cross-check not even 2nd biggest N 2 problem Regfile is also an N 2 problem (think latency where N is #ports) And also more serious than cross-check

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar18 Mitigating N 2 Bypass & Register File Clustering: mitigates N 2 bypass Group ALUs into K clusters Full bypassing within a cluster Limited bypassing between clusters With 1 or 2 cycle delay Can hurt IPC, but faster clock (N/K) + 1 inputs at each mux (N/K) 2 bypass paths in each cluster Steering: key to performance Steer dependent insns to same cluster Cluster register file, too Replicate a register file per cluster All register writes update all replicas Fewer read ports; only for cluster

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar19 Mitigating N 2 RegFile: Clustering++ Clustering: split N-wide execution pipeline into K clusters With centralized register file, 2N read ports and N write ports Clustered register file: extend clustering to register file Replicate the register file (one replica per cluster) Register file supplies register operands to just its cluster All register writes go to all register files (keep them in sync) Advantage: fewer read ports per register! K register files, each with 2N/K read ports and N write ports DM RF0 RF1 cluster 0 cluster 1

Another Challenge: Superscalar Fetch What is involved in fetching multiple instructions per cycle? In same cache block?  no problem 64-byte cache block is 16 instructions (~4 bytes per instruction) Favors larger block size (independent of hit rate) What if next instruction is last instruction in a block? Fetch only one instruction that cycle Or, some processors may allow fetching from 2 consecutive blocks What about taken branches? How many instructions can be fetched on average? Average number of instructions per taken branch? Assume: 20% branches, 50% taken  ~10 instructions Consider a 5-instruction loop with an 4-issue processor Without smarter fetch, ILP is limited to 2.5 (not 4, which is bad) CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar20

Increasing Superscalar Fetch Rate Option #1: over-fetch and buffer Add a queue between fetch and decode (18 entries in Intel Core2) Compensates for cycles that fetch less than maximum instructions “decouples” the “front end” (fetch) from the “back end” (execute) Option #2: “loop stream detector” (Core 2, Core i7) Put entire loop body into a small cache Core2: 18 macro-ops, up to four taken branches Core i7: 28 micro-ops (avoids re-decoding macro-ops!) Any branch mis-prediction requires normal re-fetch Other options: next-next-block prediction, “trace cache” CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar21 regfile D$ I$ BPBP insn queue also loop stream detector

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar22 Multiple-Issue Implementations Statically-scheduled (in-order) superscalar What we’ve talked about thus far +Executes unmodified sequential programs –Hardware must figure out what can be done in parallel E.g., Pentium (2-wide), UltraSPARC (4-wide), Alpha (4-wide) Very Long Instruction Word (VLIW) -Compiler identifies independent instructions, new ISA +Hardware can be simple and perhaps lower power E.g., TransMeta Crusoe (4-wide) Variant: Explicitly Parallel Instruction Computing (EPIC) A bit more flexible encoding & some hardware to help compiler E.g., Intel Itanium (6-wide) Dynamically-scheduled superscalar (next topic) Hardware extracts more ILP by on-the-fly reordering Core 2, Core i7 (4-wide), Alpha (4-wide)

Trends in Superscalar Width CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar23

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar24 Trends in Single-Processor Multiple Issue Issue width has saturated at 4-6 for high-performance cores Canceled Alpha was 8-way issue Not enough ILP to justify going to wider issue Hardware or compiler scheduling needed to exploit 4-6 effectively More on this in the next unit For high-performance per watt cores (say, smart phones) Typically 2-wide superscalar (but increasing each generation) 486PentiumPentiumIIPentium4ItaniumItaniumIICore2 Year Width

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar25 Multiple Issue Recap Multiple issue Exploits insn level parallelism (ILP) beyond pipelining Improves IPC, but perhaps at some clock & energy penalty 4-6 way issue is about the peak issue width currently justifiable Low-power implementations today typically 2-wide superscalar Problem spots N 2 bypass & register file  clustering Fetch + branch prediction  buffering, loop streaming, trace cache N 2 dependency check  VLIW/EPIC (but unclear how key this is) Implementations Superscalar vs. VLIW/EPIC

CIS 501: Comp. Arch. | Prof. Joe Devietti | Superscalar26 This Unit: (In-Order) Superscalar Pipelines Idea of instruction-level parallelism Superscalar hardware issues Bypassing and register file Stall logic Fetch “Superscalar” vs VLIW/EPIC CPUMemI/O System software App