Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

Slides:

Advertisements

Similar presentations

Computer architecture

Advertisements

Computer Architecture Instruction-Level Parallel Processors

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Compiler techniques for exposing ILP

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

COMP4611 Tutorial 6 Instruction Level Parallelism

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Optimizing single thread performance Dependence Loop transformations.

Data Dependence Types and Associated Pipeline Hazards Chapter 4 — The Processor — 1 Sections 4.7.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

COMP25212 Advanced Pipelining Out of Order Processors.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Instruction Level Parallelism (ILP) Colin Stevens.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Data Dependencies A dependency type that can cause a stall.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

Pipelining and Parallelism Mark Staveley

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Lecture 38: Compiling for Modern Architectures 03 May 02

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

Computer Architecture Principles Dr. Mike Frank

Concepts and Challenges

William Stallings Computer Organization and Architecture 8th Edition

CS203 – Advanced Computer Architecture

Instructional Parallelism

Instruction Scheduling for Instruction-Level Parallelism

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC3050 – Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Appendix C Practice Problem Set 1

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

Instruction Level Parallelism

Presentation transcript:

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies between instructions 4.3 Instruction scheduling 4.4 Preserving sequential consistency 4.5 The speed-up potential of ILP-processing CH04 TECH Computer Science

Improve CPU performance by increasing clock rates  (CPU running at 2GHz!) increasing the number of instructions to be executed in parallel  (executing 6 instructions at the same time)

How do we increase the number of instructions to be executed?

Time and Space parallelism

Pipeline (assembly line)

Result of pipeline (e.g.)

VLIW (very long instruction word,1024 bits!)

Superscalar (sequential stream of instructions)

From Sequential instructions to parallel execution Dependencies between instructions Instruction scheduling Preserving sequential consistency

4.2 Dependencies between instructions  Instructions often depend on each other in such a way that a particular instruction cannot be executed until a preceding instruction or even two or three preceding instructions have been executed. 1 Data dependencies 2 Control dependencies 3 Resource dependencies

4.2.1 Data dependencies Read after Write Write after Read Write after Write Recurrences

Data dependencies in straight-line code (RAW) RAW dependencies  i1: load r1, a  r2: add r2, r1, r1 flow dependencies true dependencies cannot be abandoned

Data dependencies in straight-line code (WAR) WAR dependencies  i1: mul r1, r2, r3  r2: add r2, r4, r5 anti-dependencies false dependencies can be eliminated through register renaming  i1: mul r1, r2, r3  r2: add r6, r4, r5  by using compiler or ILP-processor

Data dependencies in straight-line code (WAW) WAW dependencies  i1: mul r1, r2, r3  r2: add r1, r4, r5 output dependencies false dependencies can be eliminated through register renaming  i1: mul r1, r2, r3  r2: add r6, r4, r5  by using compiler or ILP-processor

Data dependencies in loops for (int i=2; i<10; i++) { x[i] = a*x[i-1] + b } cannot be executed in parallel

Data dependency graphs i1: load r1, a i2: load r2, b i3: load r3, r1, r2 i4: mul r1, r2, r4; i5: div r1, r2, r4

4.2.2 Control dependencies mul r1, r2, r3 jz zproc : zproc: load r1, x : actual path of execution depends on the outcome of multiplication impose dependencies on the logical subsequent instructions

Control Dependency Graph

Branches? Frequency and branch distance Expected frequency of (all) branch  general-purpose programs (non scientific): 20-30%  scientific programs: 5-10% Expected frequency of conditional branch  general-purpose programs: 20%  scientific programs: 5-10% Expected branch distance (between two branch)  general-purpose programs: every 3 rd -5 th instruction, on average, to be a conditional branch  scientific programs: every 10 th -20 th instruction, on average, to be a conditional branch

Impact of Branch on instruction issue fig. 4.14

4.2.3 Resource dependencies An instruction is resource-dependent on a previously issued instruction if it requires a hardware resource which is still being used by a previously issued instruction. e.g.  div r1, r2, r3  div r4, r2, r5

4.3 Instruction scheduling scheduling or arranging two or more instruction to be executed in parallel  Need to detect code dependency (detection)  Need to remove false dependency (resolution) a means to extract parallelism  instruction-level parallelism which is implicit, is made explicit Two basic approaches  Static: done by compiler  Dynamic: done by processor

Instruction Scheduling:

ILP-instruction scheduling

4.4 Preserving sequential consistency care must be taken to maintain the logical integrity of the program execution parallel execution mimics sequential execution as far as the logical integrity of program execution is concerned e.g.  add r5, r6, r7  div r1, r2, r3  jz somewhere

Concept sequential consistency

4.5 The speed-up potential of ILP-processing Parallel instruction execution may be restricted by data, control and resource dependencies. Potential speed-up when parallel instruction execution is restricted by true data and control dependencies:  general purpose programs: about 2  scientific programs: about 2-4 Why are the speed-up figures so low?  basic block (a low-efficiency method used to extract parallelism)

Basic Block is a straight-line code sequence that can only be entered at the beginning and left at its end. i1: calc: add r3, r1, r2 i2: sub r4, r1, r2 i3: mul r4, r3, r4 i4: jn negproc Basic block lengths of , with an overall average of 4.9 (RISC: general 7.8 and scientific 31.6 ) Conditional Branch  control dependencies

Two other methods for speed up Potential speed-up embodied in loops  amount to a figure of 50 (on average speed up)  unlimited resources (about 50 processors, and about 400 registers)  ideal schedule appropriate handling of control dependencies  amount to 10 to 100 times speed up  assuming perfect oracles that always pick the right path for conditional branch  control dependencies are the real obstacle in utilizing instruction-level parallelism!

What do we do without a perfect oracle? execute all possible paths in conditional branch  there are 2^N paths for N conditional branches  pursuing an exponential increasing number of paths would be an unrealistic approach. Make your Best Guess  branch prediction  pursuing both possible paths but restrict the number of subsequent conditional branches  (more more CH. 8)

How close real systems can come to the upper limits of speed-up? ambitious processor can expect to achieve speed-up figures of about  4 for general purpose programs  for scientific programs an ambitious processor:  predicts conditional branches  has 256 integer and 256 FP register  eliminates false data dependencies through register renaming,  performs a perfect memory disambiguation  maintains a gliding instruction window of 64 items