1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.

Slides:

Advertisements

Similar presentations

Instruction-Level Parallelism

Advertisements

Software Exploits for ILP We have already looked at compiler scheduling to support ILP – Altering code to reduce stalls – Loop unrolling and scheduling.

INSTRUCTION-LEVEL PARALLEL PROCESSORS

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

ENGS 116 Lecture 111 ILP: Software Approaches 2 Vincent H. Berk October 14 th Reading for monday: 3.10 – 3.15, Reading for today: 4.2 – 4.6.

Compiler techniques for exposing ILP

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

COMP4611 Tutorial 6 Instruction Level Parallelism

SIMD Single instruction on multiple data – This form of parallel processing has existed since the 1960s – The idea is rather than executing array operations.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Instruction Level Parallelism (ILP) Colin Stevens.

EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

CS203 – Advanced Computer Architecture

Computer Architecture Principles Dr. Mike Frank

Concepts and Challenges

Dynamic Branch Prediction

CS 704 Advanced Computer Architecture

Instruction-Level Parallelism (ILP)

CSL718 : VLIW - Software Driven ILP

Copyright © 2011, Elsevier Inc. All rights Reserved.

Coe818 Advanced Computer Architecture

/ Computer Architecture and Design

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Structured Programming

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Presentation transcript:

1 ILP (Recap)

2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit –average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches –Plus instructions in BB likely to depend on each other To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Simplest: loop-level parallelism to exploit parallelism among iterations of a loop Instruction Level Parallelism

3 Loop Unrolling Example: Key to increasing ILP For the loop: for (i=1; i<=1000; i++) x(i) = x(i) + s; ; The straightforward MIPS assembly code is given by: Loop: L.DF0, 0 (R1) ADD.DF4, F0, F2 S.D F4, 0(R1) SUBI R1, R1, # 8 BNE R1,Loop InstructionInstruction Latency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0

4 Loop Showing Stalls and Code Re-arrangement 1 Loop:LD F0,0(R1) 2stall 3ADDDF4,F0,F2 4stall 5stall 6 SD0(R1),F4 7 SUBIR1,R1,8 8 BNEZR1,Loop 9 stall 9 clock cycles per loop iteration. 1Loop:LDF0,0(R1) 2Stall 3ADDDF4,F0,F2 4SUBIR1,R1,8 5BNEZR1,Loop 6SD8(R1),F4 Code now takes 6 clock cycles per loop iteration Speedup = 9/6 = 1.5 The number of cycles cannot be reduced further because: The body of the loop is small The loop overhead (SUBI R1, R1, 8 and BNEZ R1, Loop)

5 Basic Loop Unrolling Concept: 4n iterations n iterations 4 iterations

6 Unroll Loop Four Times to expose more ILP and reduce loop overhead 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF6,-8(R1) 5ADDDF8,F6,F2 6SD-8(R1),F8 ;drop SUBI & BNEZ 7LDF10,-16(R1) 8ADDDF12,F10,F2 9SD-16(R1),F12 ;drop SUBI & BNEZ 10LDF14,-24(R1) 11ADDDF16,F14,F2 12SD-24(R1),F16 13SUBIR1,R1,#32 14BNEZR1,LOOP 15stall x (2 + 1)= 27 clock cycles, or 6.8 cycles per iteration (2 stalls after each ADDD and 1 stall after each LD) 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) 4LDF14,-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SD0(R1),F4 10SD-8(R1),F8 11SD-16(R1),F8 12SUBIR1,R1,#32 13BNEZR1,LOOP 14SD8(R1),F16 14 clock cycles or 3.5 clock cycles per iteration The compiler (or Hardware) must be able to:   Determine data dependency   Do code re-arrangement   Register renaming

7 Loop-Level Parallelism (LLP) Analysis Loop-Level Parallelism (LLP) analysis focuses on whether data accesses in later iterations of a loop are data dependent on data values produced in earlier iterations. e.g. in for (i=1; i<=1000; i++) x[i] = x[i] + s; the computation in each iteration is independent of the previous iterations and the loop is thus parallel. The use of X[i] twice is within a single iteration.  Thus loop iterations are parallel (or independent from each other). Loop-carried Dependence: A data dependence between different loop iterations (data produced in earlier iteration used in a later one) – limits parallelism. Instruction level parallelism (ILP) analysis, on the other hand, is usually done when instructions are generated by the compiler.

8 LLP Analysis Example 1 In the loop: for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ } (Where A, B, C are distinct non-overlapping arrays) –S2 uses the value A[i+1], computed by S1 in the same iteration. This data dependence is within the same iteration (not a loop-carried dependence).  does not prevent loop iteration parallelism. –S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] read in iteration i+1 (loop-carried dependence, prevents parallelism). The same applies for S2 for B[i] and B[i+1]  These two dependences are loop-carried spanning more than one iteration preventing loop parallelism.

9 LLP Analysis Example 2 In the loop: for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } –S1 uses the value B[i] computed by S2 in the previous iteration (loop-carried dependence) –This dependence is not circular: S1 depends on S2 but S2 does not depend on S1. –Can be made parallel by replacing the code with the following: A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Loop Start-up code Loop Completion code

10 LLP Analysis Example 2 Original Loop: A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Modified Parallel Loop: Iteration 1 Iteration 2 Iteration 100Iteration 99 Loop-carried Dependence Loop Start-up code Loop Completion code Iteration 1 Iteration 98 Iteration 99 Not Loop Carried Dependence.....