Computer Architecture Principles Dr. Mike Frank

Computer Architecture Principles Dr. Mike Frank
CDA 5155 Summer 2003 Module #17 Introduction to Advanced Pipelining: Instruction-Level Parallelism

Advanced Pipelining Techniques: More Instruction-Level Parallelism

Chapter 4 of 2rd edition, appendix A-8 & chapters 3&4 of 3rd edition
Advanced Pipelining Chapter 4 of 2rd edition, appendix A-8 & chapters 3&4 of 3rd edition Focus on Instruction-Level Parallelism (ILP): Executing multiple instructions (within a single program execution thread) simultaneously. Note even ordinary pipelining does some ILP (overlapping execution of multiple instructions). Increase ILP further using multiple-issue data-paths to initiate multiple instructions at once. Such microarchitectures are called superscalar. Examples: RS/6000, PowerPC, Pentium, etc.

Pipeline Performance Ideal pipeline CPI is minimum number of cycles per instruction issued, if no stalls occur. May be <1 in superscalar machines. E.g., Ideal CPI=1/3 in 3-way superscalar (e.g. IA-64) Real pipeline CPI = Ideal pipeline CPI + structural stalls + RAW stalls + WAR stalls + WAW stalls + control stalls (average values). Maximize performance using various techniques to eliminate stalls and reduce ideal CPI. Note: Real pipeline CPI still doesn’t account for cache misses (return to this in chapter 5).

Advanced Pipelining Techniques
Technique Reduces Loop unrolling Control stalls Basic pipeline scheduling RAW stalls Dynamic scheduling w. scoreboarding RAW stalls Dyn. sched. w. register renaming WAR & WAW stalls Dynamic branch prediction Control stalls Issuing multiple instructions per cycle Ideal pipeline CPI Compiler dependence analysis Ideal CPI & data stalls Software pipelining & trace scheduling Ideal CPI & data stalls Speculation All data & control stalls Dynamic memory disambiguation RAW stalls involving mem.

Basic Blocks & I.L.P. A basic block is a straight-line code segment with no branches in or out of it. Tend to be small: 6-7 instructions on average. ILP within a basic block is limited. Need ways to parallelize execution across multiple basic blocks! If Loop

Loop-Level Parallelism (LLP)
Perform multiple loop iterations in parallel. Works for some loops, but not others. Examples: for (I=1; I<=1000; I++) x[I] = x[I] + y[I]; for (I=1; I<=1000; I++) sum = sum + x[I] Early vector-based supercomputers (e.g. Crays) relied on this technique extensively. Compiling FORTRAN loops to vector operations.  

Converting LLP to ILP Technique of loop unrolling:
Transform multiple loop iterations into a single instruction stream, without branches. Multiple loop iterations are merged into a single basic block. Other ILP techniques can then be used to parallelize execution within this block. Loop unrolling can either be done statically by the compiler, or dynamically by the hardware.

Computer Architecture Principles Dr. Mike Frank

Similar presentations

Presentation on theme: "Computer Architecture Principles Dr. Mike Frank"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture Principles Dr. Mike Frank

Similar presentations

Presentation on theme: "Computer Architecture Principles Dr. Mike Frank"— Presentation transcript:

Similar presentations

About project

Feedback