CS203 – Advanced Computer Architecture Instruction Level Parallelism.

Slides:



Advertisements
Similar presentations
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Compiler techniques for exposing ILP
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
EENG449b/Savvides Lec /24/04 March 24, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EENG449b/Savvides Lec /24/05 February 24, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
Chapter 3 - ILP CSCI/ EENG – W01 Computer Architecture 1 Prof. Babak Beheshti Slides based on the PowerPoint Presentations created by David Patterson.
Instructor: Morris Lancaster
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Chapter 2: ILP and Its Exploitation
ILP: Instruction Level Parallelism Slides mainly from Hennessy & Patterson Computer Architecture, a Quantitative Approach 4 th edition.
Compiler Techniques for ILP
CS203 – Advanced Computer Architecture
Instruction-Level Parallelism (ILP)
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
Siddhartha Chatterjee Spring 2008
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Presentation transcript:

CS203 – Advanced Computer Architecture Instruction Level Parallelism

Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: Dynamic and Static Dynamic Rely on hardware to help discover and exploit the parallelism dynamically (e.g., Pentium 4, AMD Opteron, IBM Power) Out-of-Order execution, superscalar architectures Static: Rely on software technology to find parallelism, statically at compile-time (e.g., Itanium 2)

Instruction-Level Parallelism (ILP) Basic Block (BB) ILP is quite small BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches Plus instructions in BB likely to depend on each other To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Implies predicting branches! Simplest form of ILP loop-level parallelism to exploit parallelism among iterations of a loop. E.g., for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[i];

Loop-Level Parallelism Exploit loop-level parallelism to parallelism by “unrolling loop” either by dynamic via branch prediction and/or dataflow µarchitectures static via loop unrolling by compiler another way is vectors, to be covered later Determining instruction dependence is critical to Loop Level Parallelism If 2 instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards) dependent, they are not parallel and must be executed in order, although they may often be partially overlapped

ILP and Data Dependencies,Hazards HW/SW must preserve program order: Order instructions would execute as if executed sequentially as determined by original source program Dependences are a property of programs Opportunities for dynamic ILP constrained by Dynamic dependence resolution Resources available Ability to predict branches Data dependence, as seen before RAW, WAR, WAW

Structure of Compilers

Compiler Optimizations Goals: (1) correctness, (2) speed. Optimizations High level: at source code level procedure integration Local: within a basic block common sub-expression elimination (cse) constant propagation stack height reduction Global: across basic blocks global cse copy propagation invariant code motion induction variable elimination Machine dependent strength reduction pipeline scheduling load delay slot filling Impact of optimizations Dramatic impact on fp code Most reduction in integer and load/store operations. Some reduction in branches

Effects of Compiler Optimizations

Compilers and ISA How ISA design can help compilers Regularity: keep data types and addressing modes orthogonal whenever reasonable. Provide primitives not solutions: do not attempt to match high-level language constructs in the ISA. Simplify trade-off among alternatives: make it easier to select best code sequence. Allow the binding of compile time known values.

Example: Expression Execution Sum = a + b + c + d; The semantic is Sum = (((a + b) + c) + d); Parallel (associative) execution Sum = (a + b) + (c + d); Add fs, fa,fb 0 3 Add fs, fs, fc 3 6 Add fs, fs, fd 6 9 cycles Add f1, fa,fb 0 3 Add f2, fc, fd 1 4 Add fs, f1, f2 4 7

Compiler Support for ILP Dependency Analysis Loop Unrolling Software Pipelining

ILP in Loops Example 1 for (j=1; j<=N; j++) X[j] = X[j] + a; Dependence within same iteration on X[j]. Loop-carried dependence on j, but j is an induction variable. Example 2 for (j=1; j<=N; j++) { A[j+1] = A[j] + C[j]; /* S1 */ B[j+1] = B[j] + A[j+1]; /* S2 */ Loop carried dependence on S1. Data flow dependence from S1 to S2.

ILP in Loops (2) Example 3 for (j=1; j<=N; j++) { A[j] = A[j] + B[j]; /* S1 */ B[j+1] = C[j] + D[j]; /* S2 */ Loop carried dependence from S2 to S1; but no circular dependencies. Parallel version A[1] = A[1] + B[1]; for (j=1; j<=N-1; j++) { B[j+1] = C[j] + D[j]; A[j+1] = A[j+1] + B[j+1]; } B[N+1] = C[N+1] + D[N+1];

Dependence Analysis Dependency detection algorithms: Determine whether two references to an array element (one write and one read) are the same. assume all arrays indices to be affine X(a.i + b) and X(c.i +d) dependency can exist iff two affine functions have the same value for different indices within the loop bounds: let l <= j, k <= u if (a.j + b = c.k + d) then there is a dependency. But a, b, c, d may not be known at compile time. GCD test sufficient condition for the existence of dependency: if dependency exists then GCD test is true. GCD(c,a) must divide (d-b).

Dependence Analysis (2) GCD Example: X(a.i + b) and X(c.i +d) for(int i=0;i<100;i++) { A[6*i]=B[i]; /*s1*/ C[i]=A[4*i+1]; /*s2*/ } GCD test: a=6, b=0, c=4, d=1, GCD(c,a) = 2 & (d-b) = 1 There are no dependencies between the accesses to array A.

Dependence Analysis (3) In the general case Determining whether a dependency exists is NP-complete. Exact tests exist for limited cases. A hierarchy of tests in increasing generality and cost is used. Drawback of dependency analysis: Applies only to references within single loop nests with affine index functions. Where it fails: pointer references opposed to index; indirect indexing (e.g. A[B[I]]) used in sparse matrix computations; when dependency can potentially exist statically but never does in practice (at run time); when optimization depends on knowing which write of a variable a read depends on.

InstructionInstructionLatency stalls between producing resultusing result in cycles in cycles FP ALU opAnother FP ALU op 4 3 FP ALU opStore double 3 2 Load doubleFP ALU op 1 1 Load doubleStore double 1 0 Integer opInteger op 1 0 Software Techniques - Example This code, add a scalar to a vector: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; Assume following latencies for all examples Ignore delayed branch in these examples

FP Loop: Where are the Hazards? First translate into MIPS code: To simplify, assume 8 is lowest address for (i=1000; i>0; i=i–1) x[i] = x[i] + s; Loop:L.DF0,0(R1);F0 <= x[I] ADD.D F4,F0,F2;add scalar from F2 S.D0(R1),F4;store result; x[I] <= F4 DADDUIR1,R1,-8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1 != 0

FP Loop Showing Stalls overall 9 clock cycles InstructionInstructionstalls between producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:L.DF0,0(R1) ;F0=vector element 2stall 3ADD.DF4,F0,F2;add scalar in F2 4stall 5stall 6 S.D0(R1),F4;store result 7 DADDUIR1,R1,-8;decrement pointer 8B (DW) 8stall;assumes can ’ t forward to branch 9 BNEZR1,Loop;branch R1!=zero

Revised FP Loop Minimizing Stalls 7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead. Can we make it faster? 1 Loop:L.DF0,0(R1) 2DADDUIR1,R1,-8 3ADD.DF4,F0,F2 4stall 5stall 6S.D8(R1),F4;altered offset when move DSUBUI 7 BNEZR1,Loop Swap DADDUI and S.D by changing address of S.D

1 Loop:L.DF0,0(R1) 3ADD.DF4,F0,F2 6S.D0(R1),F4 ;drop DSUBUI & BNEZ 7L.DF6,-8(R1) 9ADD.DF8,F6,F2 12S.D-8(R1),F8 ;drop DSUBUI & BNEZ 13L.DF10,-16(R1) 15ADD.DF12,F10,F2 18S.D-16(R1),F12 ;drop DSUBUI & BNEZ 19L.DF14,-24(R1) 21ADD.DF16,F14,F2 24S.D-24(R1),F16 25DADDUIR1,R1,#-32;alter to 4*8 26BNEZR1,LOOP 27 clock cycles, or 6.75 per iteration Unroll Loop Four Times Rewrite loop to minimize stalls? 1 cycle stall 2 cycles stall

Unrolled Loop That Minimizes Stalls 1 Loop:L.DF0,0(R1) 2L.DF6,-8(R1) 3L.DF10,-16(R1) 4L.DF14,-24(R1) 5ADD.DF4,F0,F2 6ADD.DF8,F6,F2 7ADD.DF12,F10,F2 8ADD.DF16,F14,F2 9S.D0(R1),F4 10S.D-8(R1),F8 11S.D-16(R1),F12 12DSUBUIR1,R1,#32 13 S.D8(R1),F16 ; 8-32 = BNEZR1,LOOP 14 clock cycles, or 3.5 per iteration

Loop Unrolling Decisions Understanding dependencies between instructions and how the instructions can be reordered Determine loop unrolling usefulness by finding that loop iterations were independent (except for maintenance code) Use different registers to avoid unnecessary constraints forced by using same registers for different computations Eliminate the extra test and branch instructions and adjust the loop termination and iteration code Determine that loads and stores in unrolled loop are independent by observing that loads and stores from different iterations are independent: requires analyzing memory addresses and finding that they do not refer to the same address Schedule the code, preserving any dependences needed to yield the same result as the original code

3 Limits to Loop Unrolling 1. Decrease in amount of overhead amortized with each extra unrolling Amdahl’s Law 2. Growth in code size For larger loops, concern it increases the instruction cache miss rate 3. Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling If not be possible to allocate all live values to registers, may lose some or all of its advantage Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

Interleaves instructions from different iterations without loop unrolling. Software Pipelining Software Pipelining: Loop: X[i] = X[i] + a; % 0 <= i < N Loop: LD F0, 0(R1) ADDD F4, F0, F2 SUBI R1, R1, #8 SD 8(R1), F4 BNEZ R1, Loop Cycles Assume LD is 2 cycles and ADDD is 3 cycles Total 7 cycles per iteration

Software Pipeline Example LD F0, 0(R1)% load X[0] ADDD F4, F0, F2% add X[0] SUBI R1, R1, #8 LD F0, 0(R1)% load X[1] Loop SD8(R1), F4 % store X[I] ADDD F4, F0, F2 % add X[I-1] LDF0, 0(R1) % load X[I-2] SUBIR1, R1, #8 BNEZR1, Loop SD0(R1), F4 % store X[N-2] ADDD F4, F0, F2 % add X[N-1] SD-8(R1), F4 % store X[N-1] Symbolic loop unrolling prologue epilogue Body: no dependencies

Software Pipelining (2) Needs startup code before loop and finish code after: Prologue: LD for iterations 1 and 2, ADDD for iteration 1 Epilogue: ADDD for last iteration and SD for last 2 iterations. Software equivalent of Tomasulo. Interleaves instructions from different iterations without loop unrolling. Pros and Cons Makes register allocation and management difficult. Advantage over loop unrolling: less code space used. SP reduces loop idle time, loop unrolling reduces loop overhead. Best use combination of both. Limitations: loop carried dependencies