CS203 – Advanced Computer Architecture

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Advertisements

Compiler techniques for exposing ILP
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CS203 – Advanced Computer Architecture Instruction Level Parallelism.
Compiler Techniques for ILP
Concepts and Challenges
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS5100 Advanced Computer Architecture Instruction-Level Parallelism
Instruction-Level Parallelism (ILP)
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
CPE 631 Lecture 13: Exploiting ILP with SW Approaches
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Compiler techniques for exposing ILP (cont)
Siddhartha Chatterjee Spring 2008
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Advanced Computer Architecture
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

CS203 – Advanced Computer Architecture Instruction Level Parallelism

Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: Dynamic and Static Dynamic Rely on hardware to help discover and exploit the parallelism dynamically (e.g., Pentium 4, AMD Opteron, IBM Power) Out-of-Order execution, superscalar architectures Static: Rely on software technology to find parallelism, statically at compile-time (e.g., Itanium 2)

Instruction-Level Parallelism (ILP) Basic Block (BB) ILP is quite small BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches Plus instructions in BB likely to depend on each other To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Implies predicting branches! Simplest form of ILP loop-level parallelism to exploit parallelism among iterations of a loop. E.g., for (i=1; i<=1000; i=i+1)         x[i] = x[i] + y[i];

Loop-Level Parallelism Exploit loop-level parallelism to parallelism by “unrolling loop” either by dynamic via branch prediction and/or dataflow µarchitectures static via loop unrolling by compiler another way is vectors, to be covered later Determining instruction dependence is critical to Loop Level Parallelism If 2 instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards) dependent, they are not parallel and must be executed in order, although they may often be partially overlapped

ILP and Data Dependencies,Hazards HW/SW must preserve program order: Order instructions would execute as if executed sequentially as determined by original source program Dependences are a property of programs Opportunities for dynamic ILP constrained by Dynamic dependence resolution Resources available Ability to predict branches Data dependence, as seen before RAW, WAR, WAW

Structure of Compilers

Compiler Optimizations Machine dependent strength reduction pipeline scheduling load delay slot filling Impact of optimizations Dramatic impact on fp code Most reduction in integer and load/store operations. Some reduction in branches Goals: (1) correctness, (2) speed. Optimizations High level: at source code level procedure integration Local: within a basic block common sub-expression elimination (cse) constant propagation stack height reduction Global: across basic blocks global cse copy propagation invariant code motion induction variable elimination

Effects of Compiler Optimizations

Compilers and ISA How ISA design can help compilers Regularity: keep data types and addressing modes orthogonal whenever reasonable. Provide primitives not solutions: do not attempt to match high-level language constructs in the ISA. Simplify trade-off among alternatives: make it easier to select best code sequence. Allow the binding of compile time known values.

Example: Expression Execution Sum = a + b + c + d; The semantic is Sum = (((a + b) + c) + d); Parallel (associative) execution Sum = (a + b) + (c + d); Add fs, fa,fb 0 3 Add fs, fs, fc 3 6 Add fs, fs, fd 6 9 cycles Add f1, fa,fb 0 3 Add f2, fc, fd 1 4 Add fs, f1, f2 4 7

Compiler Support for ILP Dependency Analysis Loop Unrolling Software Pipelining

ILP in Loops Example 1 Example 2 for (j=1; j<=N; j++) { for (j=1; j<=N; j++) X[j] = X[j] + a; Dependence within same iteration on X[j]. Loop-carried dependence on j, but j is an induction variable. Example 2 for (j=1; j<=N; j++) { A[j+1] = A[j] + C[j]; /* S1 */ B[j+1] = B[j] + A[j+1]; /* S2 */ Loop carried dependence on S1. Data flow dependence from S1 to S2. Induction variable = variable that gets incremented or decremented by a fixed amount on every iteration of a loop

ILP in Loops (2) Example 3 for (j=1; j<=N; j++) { A[j] = A[j] + B[j]; /* S1 */ B[j+1] = C[j] + D[j]; /* S2 */ Loop carried dependence from S2 to S1; but no circular dependencies. Parallel version A[1] = A[1] + B[1]; for (j=1; j<=N-1; j++) { B[j+1] = C[j] + D[j]; A[j+1] = A[j+1] + B[j+1]; } B[N+1] = C[N+1] + D[N+1];

Dependence Analysis Dependency detection algorithms: GCD test Determine whether two references to an array element (one write and one read) are the same. assume all arrays indices to be affine X(a.i + b) and X(c.i +d) dependency can exist iff two affine functions have the same value for different indices within the loop bounds: let l <= j, k <= u if (a.j + b = c.k + d) then there is a dependency. But a, b, c, d may not be known at compile time. GCD test sufficient condition for the existence of dependency: if dependency exists then GCD test is true. GCD(c,a) must divide (d-b).

Dependence Analysis (2) GCD Example: X(a.i + b) and X(c.i +d) for(int i=0;i<100;i++) { A[6*i]=B[i]; /*s1*/ C[i]=A[4*i+1]; /*s2*/ } GCD test: a=6, b=0, c=4, d=1, GCD(c,a) = 2 & (d-b) = 1 There are no dependencies between the accesses to array A.

Dependence Analysis (3) In the general case Determining whether a dependency exists is NP-complete. Exact tests exist for limited cases. A hierarchy of tests in increasing generality and cost is used. Drawback of dependency analysis: Applies only to references within single loop nests with affine index functions. Where it fails: pointer references opposed to index; indirect indexing (e.g. A[B[I]]) used in sparse matrix computations; when dependency can potentially exist statically but never does in practice (at run time); when optimization depends on knowing which write of a variable a read depends on.

Software Techniques - Example This code, add a scalar to a vector: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; Assume following latencies for all examples Ignore delayed branch in these examples Instruction Instruction Latency stalls between producing result using result in cycles in cycles FP ALU op Another FP ALU op 4 3 FP ALU op Store double 3 2 Load double FP ALU op 1 1 Load double Store double 1 0 Integer op Integer op 1 0

FP Loop: Where are the Hazards? First translate into MIPS code: To simplify, assume 8 is lowest address for (i=1000; i>0; i=i–1) x[i] = x[i] + s; Loop: L.D F0,0(R1) ; F0 <= x[I] ADD.D F4,F0,F2; add scalar from F2 S.D 0(R1),F4 ; store result; x[I] <= F4 DADDUI R1,R1,-8; decrement pointer 8B (DW) BNEZ R1,Loop; branch R1 != 0

FP Loop Showing Stalls overall 9 clock cycles 1 Loop: L.D F0,0(R1) ;F0=vector element 2 stall 3 ADD.D F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 S.D 0(R1),F4 ;store result 7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW) 8 stall ;assumes can’t forward to branch 9 BNEZ R1,Loop ;branch R1!=zero Instruction Instruction stalls between producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 overall 9 clock cycles

Revised FP Loop Minimizing Stalls 1 Loop: L.D F0,0(R1) 2 DADDUI R1,R1,-8 3 ADD.D F4,F0,F2 4 stall 5 stall 6 S.D 8(R1),F4 ;altered offset when move DSUBUI 7 BNEZ R1,Loop Swap DADDUI and S.D by changing address of S.D 7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead. Can we make it faster? 3 clocks doing work, 3 overhead (stall, branch, sub)

Unroll Loop Four Times Rewrite loop to minimize stalls? 1 cycle stall 1 Loop: L.D F0,0(R1) 3 ADD.D F4,F0,F2 6 S.D 0(R1),F4 ;drop DSUBUI & BNEZ 7 L.D F6,-8(R1) 9 ADD.D F8,F6,F2 12 S.D -8(R1),F8 ;drop DSUBUI & BNEZ 13 L.D F10,-16(R1) 15 ADD.D F12,F10,F2 18 S.D -16(R1),F12 ;drop DSUBUI & BNEZ 19 L.D F14,-24(R1) 21 ADD.D F16,F14,F2 24 S.D -24(R1),F16 25 DADDUI R1,R1,#-32 ;alter to 4*8 26 BNEZ R1,LOOP 27 clock cycles, or 6.75 per iteration Rewrite loop to minimize stalls? 2 cycles stall

Unrolled Loop That Minimizes Stalls 1 Loop: L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D 0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 DSUBUI R1,R1,#32 13 S.D 8(R1),F16 ; 8-32 = -24 14 BNEZ R1,LOOP 14 clock cycles, or 3.5 per iteration

Loop Unrolling Decisions Understanding dependencies between instructions and how the instructions can be reordered Determine loop unrolling usefulness by finding that loop iterations were independent (except for maintenance code) Use different registers to avoid unnecessary constraints forced by using same registers for different computations Eliminate the extra test and branch instructions and adjust the loop termination and iteration code Determine that loads and stores in unrolled loop are independent by observing that loads and stores from different iterations are independent: requires analyzing memory addresses and finding that they do not refer to the same address Schedule the code, preserving any dependences needed to yield the same result as the original code

3 Limits to Loop Unrolling Decrease in amount of overhead amortized with each extra unrolling Amdahl’s Law Growth in code size For larger loops, concern it increases the instruction cache miss rate Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling If not be possible to allocate all live values to registers, may lose some or all of its advantage Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

Software Pipelining Software Pipelining: Cycles Loop: LD F0, 0(R1) Loop: X[i] = X[i] + a; % 0 <= i < N Loop: LD F0, 0(R1) ADDD F4, F0, F2 SUBI R1, R1, #8 SD 8(R1), F4 BNEZ R1, Loop Cycles Assume LD is 2 cycles and ADDD is 3 cycles Total 7 cycles per iteration Interleaves instructions from different iterations without loop unrolling.

Software Pipeline Example LD F0, 0(R1) % load X[0] ADDD F4, F0, F2 % add X[0] SUBI R1, R1, #8 LD F0, 0(R1) % load X[1] Loop SD 8(R1), F4 % store X[I] ADDD F4, F0, F2 % add X[I-1] LD F0, 0(R1) % load X[I-2] SUBI R1, R1, #8 BNEZ R1, Loop SD 0(R1), F4 % store X[N-2] ADDD F4, F0, F2 % add X[N-1] SD -8(R1), F4 % store X[N-1] Symbolic loop unrolling prologue Body: no dependencies epilogue

Software Pipelining (2) Needs startup code before loop and finish code after: Prologue: LD for iterations 1 and 2, ADDD for iteration 1 Epilogue: ADDD for last iteration and SD for last 2 iterations. Software equivalent of Tomasulo. Interleaves instructions from different iterations without loop unrolling. Pros and Cons Makes register allocation and management difficult. Advantage over loop unrolling: less code space used. SP reduces loop idle time, loop unrolling reduces loop overhead. Best use combination of both. Limitations: loop carried dependencies