CS203 – Advanced Computer Architecture

Slides:

Advertisements

Similar presentations

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Advertisements

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Instruction Level Parallelism 2. Superscalar and VLIW processors.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.

Architecture Basics ECE 454 Computer Systems Programming

Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Use of Pipelining to Achieve CPI < 1

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

CS 352H: Computer Systems Architecture

Instruction-Level Parallelism

Instruction Level Parallelism

Computer Architecture

/ Computer Architecture and Design

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

COMP 740: Computer Architecture and Implementation

Pipeline Architecture since 1985

Out of Order Processors

CS203 – Advanced Computer Architecture

Appendix C Pipeline implementation

Lecture 12 Reorder Buffers

Advantages of Dynamic Scheduling

Pipelining: Advanced ILP

CS 5513 Computer Architecture Pipelining Examples

11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.

Lecture 6: Advanced Pipelines

Out of Order Processors

Superscalar Pipelines Part 2

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Lecture 5: Pipelining Basics

Computer Architecture

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Lecture 23: Static Scheduling for High ILP

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

* From AMD 1996 Publication #18522 Revision E

Data Dependence Distances

Midterm 2 review Chapter

Chapter 3: ILP and Its Exploitation

September 20, 2000 Prof. John Kubiatowicz

CSC3050 – Computer Architecture

Lecture 7 Dynamic Scheduling

CMSC 611: Advanced Computer Architecture

A Configurable Simulator for OOO Speculative Execution

Lecture 11: Machine-Dependent Optimization

CS 3853 Computer Architecture Pipelining Examples

Lecture: Pipelining Basics

Pipelining and Superscalar Techniques

Presentation transcript:

CS203 – Advanced Computer Architecture More Pipelining

Review: 5-stage MIPS Pipeline F D X M W Stages Instruction fetch Decode & read registers Execute Memory Write-back Assume All ALU ops in 1 cycle All memory accesses in 1 cycle Branches resolved in D stage SAXPY (Y=A*X+Y) int i; float X[N], Y[N]; for (i-0; i<N; i++){ y[i] += a*X[i];} loop l.s f1, 0(r2) // &X in r2 l.s f2, 0(r3) // &Y in r3 mult.s f1, f1, f0 // a in f0 addi r2, r2, 4 add.s f2, f2, f1 addi r3, r3, 4 bneq r2, r1, Loop // N+4 in r1 s.s f2, -4(r3) // in br delay slot No stalls, 8 cycles per iteration

How to improve the basic pipeline? CPUtime = InstCnt * CPI * ClkCycleTime CPI = ideal CPI + stalls per instruction Ideas?

Wide Pipelines (Superscalar) N instructions each clock cycle ideal CPI = 1/N Resources needed wider path to I$ multi-ported register file detect dependencies & implement forwarding,. F D X M W F D X M W F D X M W

Wide Pipelines (2) Issues Simplify: one integer and one floating-point per cycle separate register files, no forwarding between them load/store are integer Issues branch hazards & delay slots forwarding SAXPY l.s f1, 0(r2) - l.s f2, 0(r3) - addi r2, r2, 4 mult.s f1, f1, f0 addi r3, r3, 4 add.s f2, f2, f1 bneq r2, r1, Loop - s.s f2, -4(r3) - 6 cycles per iteration, 33% better F D X M W F D X M W

Deep Pipelines Deeper pipeline, smaller CCT ideal CCT = 1/k, k stages Motivations for deep pipelines variable latencies of ALU, FPU, cache etc. CCT = max{all latencies} Intel’s pipelines P5: 5 stages, Pentium, < 500MHz P6: 12 stages, Pentium 2, 3 & M, > 2GHz Netburst: 20 stages, Pentium 4, > 3GHz

Limits to Pipelining Cost/Performance tradeoffs (Peter Kogge, 1981) Non-pipelined: let T be latency and C be logic area cost Pipelined: d is latch delay, p is clock period, p =T/k + d; pipelined frequency f = 1/p pipelined area cost = C + k*h (h is latch area cost) Performance/Cost Ratio: PCR PCR is max at k0 Optimum # pipeline stages

Limits to Pipelining (2) Overhead introduced at each pipeline stage pipeline latches uneven distribution of work per stage clock skew clock may take longer to arrive at different stages Eventually overhead dominates, diminishing returns k-stage pipeline where overhead per stage is d (time) instructions are spaced by S CCT = T/k + d CPI = ideal CPI + stalls = 1 + Sk/T CPU time = (1+Sk/T).(T/k + d) T=60, d=2, S=10 k=5, CPUtime = 25.6 k=10, CPUtime = 21.3 k=15, CPUtime = 21.0 k=20, CPUtime = 21.65

MIPS R4000 pipeline

MIPS R400 Performance

For simple RISC pipeline, CPI = 1: Pipelined CPU speedup For simple RISC pipeline, CPI = 1:

Multiple pipelines In-order execution in-order issue and completion of instructions X X1 X2 X3 M1 M2 DIV integer fp add fp mult F D R W ld, st

Multiple pipelines - Tomasulo In-order issue out of order completion register renaming through reservation stations and CDB: eliminates WAW & WAR dynamic loop unrolling: loop level parallelism COMMON DATA BUS X X1 X2 X3 M1 M2 DIV integer fp add F D R fp mult W ld, st

Multiple pipelines - ROB In-order issue out of order completion in order commit: supports speculation through branch prediction ROB eliminates the CDB bottleneck separates completion from commit stages X X1 X2 X3 M1 M2 DIV integer fp add fp mult F D R W ROB ld, st