Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Slides:



Advertisements
Similar presentations
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Chapter 12 Pipelining Strategies Performance Hazards.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CMPE 421 Parallel Computer Architecture
A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Structures.
Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
PipeliningPipelining Computer Architecture (Fall 2006)
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
Instruction Level Parallelism
Smruti R. Sarangi IIT Delhi
Lynn Choi Dept. Of Computer and Electronics Engineering
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Microprocessor Microarchitecture Dynamic Pipeline
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
COMPLEXITY EFFECTIVE SUPERSCALAR PROCESSORS
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Lecture 20: OOO, Memory Hierarchy
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
* From AMD 1996 Publication #18522 Revision E
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk

Brainiac vs. Speed Demon Brainiac: Maximize # of instructions per cycle More complex, slower clock Speed Demon: Maximize clock frequency Faster clock, fewer instructions per cycle

Complexity-Effective Issue Width vs. Clock Cycle Goal is to balance both Best design is Complexity-Effective Allow complex issue scheme AND fast clock cycle

What is Complexity? Delay of Critical Path through a piece of logic.

Creating a Complexity-Effective Architecture Analyze complexity of each pipeline stage Select component with most complexity Propose less-complex alternative that achieves similar performance

Agenda OoO Superscalar Pipeline Overview Register Renaming Wakeup & Select Bypass Complexity-Effective Optimizations Dependence-based Scheduler Clustering

Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Fetch: Read Instructions from I-Cache Predict Branches

Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Decode: Parse instruction Shuffle opcode parts to appropriate ports for rename

Pipeline Overview Fetch DecodeRename Register File Data Cache Wakeup Select Issue Window Bypass Rename: Map architectural registers to physical Eliminate false dependences Dispatch renamed instructions to scheduler

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Wakeup: Instructions check whether they become ready Compare register names from Writeback stage Select: Choose from amongst ready instructions Wakeup Select Issue Window

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Register File Read: Read source operands Wakeup Select Issue Window

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Bypass and Execute: Execute instructions in functional units Bypass results from outputs to inputs Wakeup Select Issue Window

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Data Cache Access: Load & Store to data cache Wakeup Select Issue Window

Pipeline Overview Fetch DecodeRename Register File Data Cache Bypass Write result to register file Broadcast tag to wakeup waiting instrs. happens two cycles before results produced Wakeup Select Issue Window

Alternate Pipeline Fetch DecodeRename Register File Data Cache Bypass Used by Pentium Pro, PowerPC Re-order buffer (ROB) holds values Tomasulo-like renaming (point to ROB entries) Reorder Buffer Wakeup Select Reservatio n Stations

Sources of Complexity Fetch DecodeRename Register File Data Cache Bypass Complexity scales with issue width and window size. Wakeup Select Issue Window

What about..... Fetch, Decode, Register File, Cache? Previously studied Register File, Caches Easy to scale ? Fetch, Decode, Functional Units

Register Renaming Eliminate WAR and WAW hazards Ld r1, [r2] Add r3, r1, r2 Sub r1, r2, r4 Ld r2, [r1] Ld r1, [r2] Add r3, r1, r2 Sub r1, r2, r4 Ld r2, [r1] Ld p5, [p7] Add p3, p5, p7 Sub p6, p7, p4 Ld p9, [p6] Architectural (Logical) Registers Physical Registers

Register Rename Logic Logical Source & Destination Registers Register Alias Table Dependence Check Logic Logical Source & Destination Registers MUX Physical Source & Destination Registers SRAM array Scales with Issue Width

Register Alias Table Logical Register Name Physical Register Name Decode Bitlines Wordlines Senseamps T rename = T decode + T wordline + T bitline + T senseamp

Rename Delay Scales Linearly with Issue Width Better

Instruction Wakeup Broadcast newly available operands Update which operands are ready for each instruction Mark instruction ready when both left and right operands are ready

Wakeup Logic Delay = T tagdrive + T tagmatch + T matchOR Tag Drive Tag Match Tag OR

Wakeup Delay Delay increases quadratically with window size Better

Wakeup Delay Breakdown Wire delay dominates at smaller feature sizes! Better

Select Consider all instructions that are ready Select one ready instruction for each functional unit Uses some selection policy e.x., Oldest First little affect on performance

Selection Logic Delay = c 0 + c 1 ×log 4 (WINSIZE) Request Signal Grant Signal

Selection Delay Only Logic delay, wire delay ignored Better

Data Bypass Network Forward results from completing instructions to dependent instructions # of paths depends on pipeline depth and issue width # of paths = 2×IW 2 ×S S pipe stages after producing results

Data Bypass Logic FUs broadcasts results (potentially multiple subsequent results) Regfile reads current operand values MUX selects correct source

Bypass Delay T bypass = 0.5 × R metal × C metal × L 2 L = length of result wires Delay independent of feature size Dependent on specific Layout Issue Width Wire Length (λ) Delay (ps)

Putting it All Together Issue Width Window Size Rename Delay (ps) Wakeup+Select delay (ps) Bypass delay (ps) 0.8μm μm μm

Key Sources of Complexity Wakeup+Select Logic: Limiting stage for 5 out of 6 designs Considers all instructions simultaneously Bypass Paths Wire dominated delay Lots of long wires

Dependence-based Microarchitecture Fetch Decode Rename Steer Register File Data Cache Wakeup Select Bypass FIFOs Replace Issue Window with a few small FIFO queues Only schedule from head of FIFO Adds new Steering stage

Instruction Steering Heuristic 3 possible cases for instruction I: 1. All operands ready & in register file I steered to empty FIFO 2. I requires operand produced by I source in FIFO F a If no instruction behind I source, I steered to F a Otherwise, steered to empty FIFO 3. Two operands produced by I left and I right. Apply rule 2 to I left If steered to empty FIFO, apply rule 2 to I right

Steering Example 0: addu r18, r0, r2 1: addiu r2, r0, -1 2: beq r18, r2, L2 3: lw r4, (r28) 4: sllv r2, r18, r20 5: xor r16, r2, r19 6: lw r3, (r28) 7: sll r2, r16, 0x2 8: addu r2, r2, r23 9: lw r2, 0(r2) 10: sllv r4, r18, r4 11: addu r17, r4, r19 12: addiu r3, r3, 1 13: sw r3, (r28) 14: bew r2, r17, L

Performance Comparison Cycle count within 8% for all benchmarks Better

But... T Execution = Dependence-based microarchitecture is less complex Allows for faster clock Overall, dependence-based has better performance # Instructions × Clock Period Instructions per Cycle

Improving Bypass Paths One-cycle bypass within Cluster Two-cycle bypass between Clusters

Performance of Clustering IPC up to 12% lower Clock is 25% faster Overall, 10% - 22% faster 16% faster on average Better

1. 1 Window + 1 Cluster 2. 1 Window + 2 Clusters 3. 2 Windows + 2 Clusters i. Issue Window + Steering heuristic ii. Issue Window + Random steering iii. FIFOs + Steering heuristic Other Alternatives

Performance of Clustering Steering has largest impact on performance Better

Conclusions Wakeup+Select and Bypass paths likely to be limiting pipeline stages Atomicity makes these stages critical Consider more complexity-effective designs for these stages Sacrifice some IPC decrease for higher clock frequency