Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Computer Architecture Pipelines & Superscalars. Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Computer ArchitectureFall 2007 © October 29th, 2007 Majd F. Sakr CS-447– Computer Architecture.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Architecture Basics ECE 454 Computer Systems Programming
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
1 Superscalar Pipelines 11/24/08. 2 Scalar Pipelines A single k stage pipeline capable of executing at most one instruction per clock cycle. All instructions,
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Modern processor design
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
PipeliningPipelining Computer Architecture (Fall 2006)
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
ECE/CS 552: Introduction to Superscalar Processors and the MIPS R10000 © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill,
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
PowerPC 604 Superscalar Microprocessor
Out of Order Processors
Pipeline Implementation (4.6)
CS203 – Advanced Computer Architecture
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
ECE/CS 552: Pipelining to Superscalar
Lecture 12 Reorder Buffers
Flow Path Model of Superscalars
Superscalar Organization ECE/CS 752 Fall 2017
Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers The Processor
Superscalar Pipelines Part 2
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CSCE 432/832 High Performance Processor Architectures Superscalar Organization Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti,
Introduction to Computer Engineering
Lecture 11: Memory Data Flow Techniques
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
ECE/CS 552: Pipelining to Superscalar
CSC3050 – Computer Architecture
Dynamic Pipelines Like Wendy’s: once ID/RD has determined what you need, you get queued up, and others behind you can get past you. In-order front end,
Lecture 11: Machine-Dependent Optimization
ECE/CS 552: Introduction to Superscalar Processors
Presentation transcript:

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

Limitations of Scalar Pipelines Scalar upper bound on throughput –IPC = 1 Inefficient unified pipeline –Long latency for each instruction Rigid pipeline stall policy –One stalled instruction stalls all newer instructions

Parallel Pipelines

Spatial Pipeline Unrolling

Pipeline Unrolling - Power 12-stage pipeline, 60% latch power, 25% latch delay+setup, 12% area on latches, 20% leakage power © Shen, Lipasti 5

Intel Pentium Parallel Pipeline

Diversified Pipelines

Power4 Diversified Pipelines PC I-Cache BR Scan BR Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit BR Unit FX/LD 1 Issue Q FX1 Unit LD1 Unit FX/LD 2 Issue Q LD2 Unit FX2 Unit FP Issue Q FP1 Unit FP2 Unit StQ D-Cache

Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Stalled Instruction Backward Propagation of Stalling Not Allowed

Dynamic Pipelines

Interstage Buffers

Superscalar Pipeline Stages In Program Order In Program Order Out of Order

Limitations of Scalar Pipelines Scalar upper bound on throughput –IPC = 1 –Solution: wide (superscalar) pipeline Inefficient unified pipeline –Long latency for each instruction –Solution: diversified, specialized pipelines Rigid pipeline stall policy –One stalled instruction stalls all newer instructions –Solution: Out-of-order execution, distributed execution pipelines

Impediments to High IPC

Superscalar Pipeline Design Instruction Fetching Issues Instruction Decoding Issues Instruction Dispatching Issues Instruction Execution Issues Instruction Completion & Retiring Issues

Instruction Flow Challenges: –Branches: control dependences –Branch target misalignment –Instruction cache misses Solutions –Code alignment (static vs.dynamic) –Prediction/speculation Instruction Memory PC 3 instructions fetched Objective: Fetch multiple instructions per cycle

I-Cache Organization R o w D e c o d e r Cache Line TAG Address 1 cache line = 1 physical row Cache Line TAG Address 1 cache line = 2 physical rows TAG R o w D e c o d e r

Fetch Alignment

RIOS-I Fetch Hardware

Issues in Decoding Primary Tasks –Identify individual instructions (!) –Determine instruction types –Determine dependences between instructions Two important factors –Instruction set architecture –Pipeline width

Pentium Pro Fetch/Decode

Predecoding in the AMD K5

Instruction Dispatch and Issue Parallel pipeline –Centralized instruction fetch –Centralized instruction decode Diversified pipeline –Distributed instruction execution

Necessity of Instruction Dispatch

Centralized Reservation Station

Distributed Reservation Station

Issues in Instruction Execution Current trends –More parallelism  bypassing very challenging –Deeper pipelines –More diversity Functional unit types –Integer –Floating point –Load/store  most difficult to make parallel –Branch –Specialized units (media) Very wide datapaths (256 bits/register or more)

Bypass Networks O(n 2 ) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) –Use RF only (IBM Power4) with no bypass network –Decompose into clusters (Alpha 21264) PC I-Cache BR Scan BR Predict Fetch Q Decode Reorder Buffer BR/CR Issue Q CR Unit BR Unit FX/LD 1 Issue Q FX1 Unit LD1 Unit FX/LD 2 Issue Q LD2 Unit FX2 Unit FP Issue Q FP1 Unit FP2 Unit StQ D-Cache

Specialized units Intel Pentium 4 staggered adders –Fireball Run at 2x clock frequency Two 16-bit bitslices Dependent ops execute on half-cycle boundaries Full result not available until full cycle later

Specialized units FP multiply- accumulate R = (A x B) + C Doubles FLOP/instruction Lose RISC instruction format symmetry: –3 source operands Widely used

Media Data Types Subword parallel vector extensions –Media data (pixels, quantized datum) often 1-2 bytes –Several operands packed in single 32/64b register {a,b,c,d} and {e,f,g,h} stored in two 32b registers –Vector instructions operate on 4/8 operands in parallel –New instructions, e.g. sum of abs. differences (SAD) me = |a – e| + |b – f| + |c – g| + |d – h| Substantial throughput improvement –Usually requires hand-coding of critical loops –Shuffle ops (gather/scatter of vector elements) efghabcd

Issues in Completion/Retirement Out-of-order execution –ALU instructions –Load/store instructions In-order completion/retirement –Precise exceptions –Memory coherence and consistency Solutions –Reorder buffer –Store buffer –Load queue snooping (later)

A Dynamic Superscalar Processor

Impediments to High IPC

Superscalar Overview Instruction flow –Branches, jumps, calls: predict target, direction –Fetch alignment –Instruction cache misses Register data flow –Register renaming: RAW/WAR/WAW Memory data flow –In-order stores: WAR/WAW –Store queue: RAW –Data cache misses