Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Slides:

Advertisements

Similar presentations

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Advertisements

Software Exploits for ILP We have already looked at compiler scheduling to support ILP – Altering code to reduce stalls – Loop unrolling and scheduling.

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Compiler techniques for exposing ILP

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

COMP4611 Tutorial 6 Instruction Level Parallelism

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 7: Dynamic Scheduling and Branch Prediction * Jeremy R. Johnson Wed. Nov. 8, 2000.

Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Instruction Level Parallelism (ILP) Colin Stevens.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Data Dependencies A dependency type that can cause a stall.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

Final Review Prof. Mike Schulte Advanced Computer Architecture ECE 401.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Instruction-Level Parallelism and Its Dynamic Exploitation

Computer Architecture Principles Dr. Mike Frank

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CS203 – Advanced Computer Architecture

CSCE430/830 Computer Architecture

Computer Architecture

Coe818 Advanced Computer Architecture

Siddhartha Chatterjee Spring 2008

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CS203 – Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Advanced Architecture +

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Systems Architecture II

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Presentation transcript:

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition, 1996, Morgan Kaufmann

TechniqueReduces Loop unrolling=> Control stalls Basic pipeline scheduling=> RAW stalls Dynamic scheduling with scoreboarding=> RAW stalls Dynamic scheduling with register renaming=> WAR & WAW stalls Dynamic branch prediction=> Control stalls Issuing multiple instructions per cycle=> Ideal CPI Compiler dependence analysis=> Ideal CPI & data stalls Software pipelining and trace scheduling=> Ideal CPI & data stalls Speculation=> All data & control stalls Dynamic memory disambiguation=> RAW stalls involving memory

Instruction-Level Parallelism The simplest and most common way to increase the amount of parallelism available among instructions is to exploit parallelism among iterations of a loop. This type of parallelism is often called loop-level parallelism. Here is a simple example of a loop, which adds two 1000-element arrays, that is comletely parallel : for ( i = 1; i <= 1000; i = i + 1 ) x[i] = x[i] + y[i] Every iteration of the loop can overlap with any other iteration, although within each loop iteration there is little opportunity for overlap.

Two strategies to support ILP: Dynamic Scheduling: depend on the hardware to locate parallelism Static Scheduling: rely on software for identifying potential parallelism Instruction-Level Parallelism

Basic Pipeline Scheduling and Loop Unrolling Loop :LDF0,0(R1); F0 = array element ADDDF4,F0,F2; add scalar in F2 SD0(R1),F4; store result SUBIR1,R1,#8; decrement pointer 8 bytes BNEZR1,Loop; branch R1 != zero for (i = 1; i <= 1000; i++) x[i] = x[i] + s;

We can schedule the loop to obtain only one stall : Loop :LDF0,0(R1) SUBIR1,R1,#8 ADDDF4,F0,F2 stall BNEZR1,Loop; delayed branch SD8(R1),F4; altered & interchanged with SUBI Execution time has been reduced from 10 clock cycles to 6. Without any scheduling : Loop : LDF0,0(R1)1 stall2 ADDDF4,F0,F23 stall4 stall5 SD0(R1),F46 SUBIR1,R1,#87 stall8 BNEZR1,Loop9 stall10

In the above example, we complete one loop iteration and store back one array element every 6 clock cycles, but the actual work of operating on the array element takes just 3 (the load, add, and store) of those 6 clock cycles. The remaining 3 clock cycles consist of loop overhead—the SUBI and BNEZ —and a stall. To eliminate these 3 clock cycles we need to get more operations within the loop relative to the number of overhead instructions. A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop termination code.

EXAMPLE Show our loop unrolled so that there are four copies of the loop body, Assuming R1 is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. Eliminate any obviously redundant computations and do not reuse any of the registers. ANSWER Here is the result after merging the SUBI instructions and dropping the Unnecessary BNEZ operations that are duplicated during unrolling. Note that R2 must now be set so that 32(R2) is the starting address of the last four elements. Loop: LD F0,0(R1) ADDD F4,F0,F2 SD F4,0(R1) ;drop SUBI & BNEZ LD F6,-8(R1) ADDD F8,F6,F2 SD F8,-8(R1) ;drop SUBI & BNEZ LD F10,-16(R1)

ADDD F12,F10,F2 SD F12,-16(R1) ;drop SUBI & BNEZ LD F14,-24(R1) ADDD F16,F14,F2 SD F16,-24(R1) SUBI R1,R1,#-32 BNEZ R1,R2,Loop We have eliminated three branches and three decrements of R1. The addresses on the loads and stores have been compensated to allow the SUBI instructions on R1 to be merged. Without scheduling, every operation in the unrolled loop is followed by a dependent operation and thus will cause a stall. This loop will run in 28 clock cycles—each LD has 1 stall, each ADDD 2, the SUBI 1, the branch 1, plus 14 instruction issue cycles—or 7 clock cycles for each of the four elements. Although this unrolled version is currently slower than the scheduled version of the original loop, this will change when we schedule the unrolled loop. Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer.

EXAMPLE Show the unrolled loop in the previous example after it has been scheduledfor the pipeline. ANSWER Loop: LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 SD F4,0(R1) SD F8,-8(R1) SUBI R1,R1,#-32 SD F12,16(R1) BNEZ R1,R2,Loop SD F16,8(R1) ;8-32 = -24 The execution time of the unrolled loop has dropped to a total of 14 clock cycles, or 3.5 clock cycles per element, compared with 7 cycles per element before scheduling and 6 cycles when scheduled but not unrolled.

Dynamic scheduling with scoreboarding Scoreboarding is a technique for allowing instructions to execute out of order when there are sufficient resources and no data dependences. The four step, which replace the ID, EX, and WB steps in the standard DLX pipeline, are as follows : 1. Issue – If a functional unit for the instruction is free and no other active instruction has the same destination register, the scoreboard issues the instruction to the functional unit and updates its internal data structure. 2. Read operand – The scoreboard monitors the availability of the source operands. When the source operands are available, the scoreboard tell the functional unit to proceed to read the operands from the registers and begin execution. 3. Execution – The functional unit begins execution upon receiving operands. 4. Write result

The Basic Structure of a DLX processor with a scoreboard

There are three parts to the scoreboard : 1. Instruction status – Indicates which of the four steps the instruction is in. 2. Functional Unit Status – Indicates the state of the functional unit (FU). There are nine fields for each functional unit : Busy – Indicates whether the unit is busy or not. Op – Operation to perform in the unit. Fi – Destination register. Fj, Fk – Source-register numbers. Qj, Qk – Functional units producing source registers Fj, Fk. Rj, Rk – Flags indicating when Fj, Fk are ready 3. Register result status – Indicates which functional unit will write each register, if an active instruction has the register as its destination.

Component of the Scoreboard

Example Assume the following EX cycle latencies ( chosen to illustrate the behavior and not representative ) for the floating-point functional units : Add is 2 clock cycles, multiply is 10 clock cycles, and divide is 40 clock cycles. Solution

Dynamic Branch Prediction The simplest dynamic branch-prediction scheme is a branch- prediction buffer or branch history table. A branch-prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not. This scheme is the simplest sort of buffer; it has no tags and is useful only to reduce the branch delay when it is longer than the time to compute the possible target PCs.

The two-bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2 n – 1: when the counter is greater than or equal to one half of its maximum value (2 n–1 ), the branch is predicted as taken; otherwise, it is predicted untaken. As in the two- bit scheme, the counter is incremented on a taken branch and decremented on an untaken branch. Studies of n-bit predictors have shown that the two-bit predictors do almost as well, and thus most systems rely on two-bit branch predictors rather than the more general n-bit predictors.