1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.

Slides:



Advertisements
Similar presentations
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
Instructor: Morris Lancaster
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Compiler Techniques for ILP
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Siddhartha Chatterjee Spring 2008
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.

2Outline  Motivation  Compiler scheduling Loop unrolling Loop unrolling Software pipelining Software pipelining Static branch prediction Static branch prediction VLIW VLIW Reading: HP3, Sections

3 Review: Instruction-Level Parallelism (ILP)  Pipelining most effective when: parallelism among instrs instrs u and v are parallel if neither P(u,v) nor P(v,u) holds instrs u and v are parallel if neither P(u,v) nor P(v,u) holds  Problem: parallelism within a basic block is limited branch freq of 15%: implies about 6 instructions in basic block branch freq of 15%: implies about 6 instructions in basic block these instructions are likely to depend on each other these instructions are likely to depend on each other  need to look beyond basic blocks  Solution: exploit loop-level parallelism i.e., parallelism across loop iterations i.e., parallelism across loop iterations to convert loop-level parallelism into ILP, need to “unroll” the loop to convert loop-level parallelism into ILP, need to “unroll” the loop  dynamically, by the hardware  statically, by the compiler  using vector instructions –same operation is applied to all the vector elements

4 Motivating Example for Loop Unrolling for (i = 1000; i > 0; i--) x[i] = x[i] + s; Assumptions Scalar s is in register F2 Array x starts at memory address 0 1-cycle branch delay No structural hazards LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP 10 cycles per iteration

5 How Far Can We Get With Scheduling? LOOP: L.DF0, 0(R1) DADDUIR1, R1, -8 ADD.DF4, F0, F2 nop BNEZR1, LOOP S.D 8(R1), F4 LOOP: L.DF0, 0(R1) DADDUIR1, R1, -8 ADD.DF4, F0, F2 nop BNEZR1, LOOP S.D 8(R1), F4 LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP 6 cycles per iteration Note change in S.D instruction, from 0(R1) to 8(R1) ; this is a non-trivial change!

6 Observations on Scheduled Code  3 out of 5 instructions involve FP work  The other two constitute loop overhead  Could we improve performance by unrolling the loop? assume number of loop iterations is a multiple of 4, and unroll loop body four times assume number of loop iterations is a multiple of 4, and unroll loop body four times  in real life, must also handle loop counts that are not multiples of 4

7 Unrolling: Take 1  Even though we have gotten rid of the control dependences, we have data dependences through R1  We could remove data dependences by observing that R1 is decremented by 8 each time Adjust the address specifiers Adjust the address specifiers Delete the first three DADDUI’s Delete the first three DADDUI’s Change the constant in the fourth DADDUI to 32 Change the constant in the fourth DADDUI to 32  These are non-trivial inferences for a compiler to make LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP

8 Unrolling: Take 2  Performance is now limited by the WAR dependencies on F0  These are name dependences The instructions are not in a producer-consumer relation The instructions are not in a producer-consumer relation They are simply using the same registers, but they don’t have to They are simply using the same registers, but they don’t have to We can use different registers in different loop iterations, subject to availability We can use different registers in different loop iterations, subject to availability  Let’s rename registers LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF0, -8(R1) ADD.DF4, F0, F2 S.D -8(R1), F4 L.DF0, -16(R1) ADD.DF4, F0, F2 S.D-16(R1), F4 L.DF0, -24(R1) ADD.DF4, F0, F2 S.D-24(R1), F4 DADDUIR1, R1, -32 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF0, -8(R1) ADD.DF4, F0, F2 S.D -8(R1), F4 L.DF0, -16(R1) ADD.DF4, F0, F2 S.D-16(R1), F4 L.DF0, -24(R1) ADD.DF4, F0, F2 S.D-24(R1), F4 DADDUIR1, R1, -32 BNEZR1, LOOP NOP

9 Unrolling: Take 3 LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF6, -8(R1) ADD.DF8, F6, F2 S.D-8(R1), F8 L.DF10, -16(R1) ADD.DF12, F10, F2 S.D-16(R1), F12 L.DF14, -24(R1) ADD.DF16, F14, F2 S.D-24(R1), F16 DADDUIR1, R1, -32 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF6, -8(R1) ADD.DF8, F6, F2 S.D-8(R1), F8 L.DF10, -16(R1) ADD.DF12, F10, F2 S.D-16(R1), F12 L.DF14, -24(R1) ADD.DF16, F14, F2 S.D-24(R1), F16 DADDUIR1, R1, -32 BNEZR1, LOOP NOP  Time for execution of 4 iterations 14 instruction cycles 14 instruction cycles 4 L.D  ADD.D stalls 4 L.D  ADD.D stalls 8 ADD.D  S.D stalls 8 ADD.D  S.D stalls 1 DADDUI  BNEZ stall 1 DADDUI  BNEZ stall 1 branch delay stall (NOP) 1 branch delay stall (NOP)  28 cycles for 4 iterations, or 7 cycles per iteration  Slower than scheduled version of original loop, which needed 6 cycles per iteration  Let’s schedule the unrolled loop

10 Unrolling: Take 4  This code runs without stalls 14 cycles for 4 iterations 14 cycles for 4 iterations 3.5 cycles per iteration 3.5 cycles per iteration loop control overhead = once every four iterations loop control overhead = once every four iterations  Note that original loop had three FP instructions that were not independent  Loop unrolling exposed independent instructions from multiple loop iterations  By unrolling further, can approach asymptotic rate of 3 cycles per instruction Subject to availability of registers Subject to availability of registers LOOP: L.DF0, 0(R1) L.DF6, -8(R1) L.DF10, -16(R1) L.DF14, -24(R1) ADD.DF4, F0, F2 ADD.DF8, F6, F2 ADD.DF12, F10, F2 ADD.DF16, F14, F2 S.D0(R1), F4 S.D-8(R1), F8 DADDUIR1, R1, -32 S.D16(R1), F12 BNEZR1, LOOP S.D8(R1), F16 LOOP: L.DF0, 0(R1) L.DF6, -8(R1) L.DF10, -16(R1) L.DF14, -24(R1) ADD.DF4, F0, F2 ADD.DF8, F6, F2 ADD.DF12, F10, F2 ADD.DF16, F14, F2 S.D0(R1), F4 S.D-8(R1), F8 DADDUIR1, R1, -32 S.D16(R1), F12 BNEZR1, LOOP S.D8(R1), F16

11 What Did The Compiler Have To Do?  Determine it was legal to move the S.D after the DADDUI and BNEZ, and find the amount to adjust the S.D offset  Determine that loop unrolling would be useful by discovering independence of loop iterations  Rename registers to avoid name dependences  Eliminate extra tests and branches and adjust loop control  Determine that L.D’s and S.D’s can be interchanged by determining that (since R1 is not being updated) the address specifiers 0(R1), -8(R1), -16(R1), -24(R1) all refer to different memory locations  Schedule the code, preserving dependences

12 Limits to Gain from Loop Unrolling  Benefit of reduction in loop overhead tapers off Amount of overhead amortized diminishes with successive unrolls Amount of overhead amortized diminishes with successive unrolls  Code size limitations For larger loops, code size growth is a concern For larger loops, code size growth is a concern  Especially for embedded processors with limited memory Instruction cache miss rate increases Instruction cache miss rate increases  Architectural/compiler limitations Register pressure Register pressure  Need many registers to exploit ILP  Especially challenging in multiple-issue architectures

13 Dependences in the Loop Context  Three kinds of dependences Data dependence Data dependence Name dependence Name dependence Control dependence Control dependence  In the context of loop-level parallelism, data dependence can be Loop-independent Loop-independent Loop-carried Loop-carried  Data dependences act as a limit of how much ILP can be exploited in a compiled program Compiler tries to identify and eliminate dependences Compiler tries to identify and eliminate dependences Hardware tries to prevent dependences from becoming stalls Hardware tries to prevent dependences from becoming stalls

14 Data and Name Dependences  Instruction v is data-dependent on instruction u if u produces a result that v consumes u produces a result that v consumes  Instruction v is anti-dependent on instruction u if u precedes v u precedes v v writes a register or memory location that u reads v writes a register or memory location that u reads  Instruction v is output-dependent on instruction u if u precedes v u precedes v v writes a register or memory location that u writes v writes a register or memory location that u writes  Relationship to Hazards: A data dependence that cannot be removed by renaming corresponds to a RAW hazard A data dependence that cannot be removed by renaming corresponds to a RAW hazard Anti-dependence corresponds to a WAR hazard Anti-dependence corresponds to a WAR hazard Output dependence corresponds to a WAW hazard Output dependence corresponds to a WAW hazard

15 Control Dependences  A control dependence determines the ordering of an instruction with respect to a branch instruction so that the non-branch instruction is executed only when it should be if (p1) {s1;} if (p2) {s2;}  Control dependence constrains code motion An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

16 Data Dependence in Loop Iterations A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+2] = A[u+1]+C[u+1]; B[u+2] = B[u+1]+A[u+2]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+2] = A[u+1]+C[u+1]; B[u+2] = B[u+1]+A[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2];

17 Loop Transformation A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u];  Sometimes loop-carried dependence does not prevent loop parallelization Example: Second loop of previous slide Example: Second loop of previous slide  In other cases, loop-carried dependence prohibits loop parallelization Example: First loop of previous slide Example: First loop of previous slide A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2];

18 Software Pipelining  Observation If iterations from loops are independent, then we can get ILP by taking instructions from different iterations If iterations from loops are independent, then we can get ILP by taking instructions from different iterations  Software pipelining reorganize loops so that each iteration is made from instructions chosen from different iterations of the original loop reorganize loops so that each iteration is made from instructions chosen from different iterations of the original loop i4 i3 i2 i1 i0 Software Pipeline Iteration

19 Software Pipelining Example After: Software Pipelined L.DF0,0(R1) ADD.DF4,F0,F2 L.DF0,-8(R1) 1 S.D0(R1),F4; Stores M[i] 2 ADD.DF4,F0,F2; Adds to M[i-1] 3 L.DF0,-16(R1); loads M[i-2] 4 DADDUI R1,R1,-8 5 BNEZR1,LOOP S.D0(R1),F4 ADD.DF4,F0,F2 S.D-8(R1),F4 IF ID EX Mem WB S.D ADD.D L.D Read F4 Write F4 Read F0 Write F0 Before: Unrolled 3 times 1 L.DF0,0(R1) 2 ADD.DF4,F0,F2 3 S.D0(R1),F4 4 L.DF0,-8(R1) 5 ADD.DF4,F0,F2 6 S.D-8(R1),F4 7 L.DF0,-16(R1) 8 ADD.DF4,F0,F2 9 S.D-16(R1),F4 10 DADDUI R1,R1, BNEZR1,LOOP (As in slide 4)

20 Software Pipelining: Concept Loop: L i E i S i B Loop Loop: L i E i S i B Loop L 1 E 1 S 1 B Loop L 2 E 2 S 2 B Loop L 3 E 3 S 3 B Loop … L n E n S n  Notation: Load, Execute, Store  Iterations are independent  In normal sequence, E i depends on L i, and S i depends on E i, leading to pipeline stalls  Software pipelining attempts to reduce these delays by inserting other instructions between such dependent pairs and “hiding” the delay “Other” instructions are L and S instructions from other loop iterations “Other” instructions are L and S instructions from other loop iterations  Does this without consuming extra code space or registers Performance usually not as high as that of loop unrolling Performance usually not as high as that of loop unrolling  How can we permute L, E, S to achieve this? “A Study of Scalar Compilation Techniques for Pipelined Supercomputers”, S. Weiss and J. E. Smith, ISCA 1987, pages

21 An Abstract View of Software Pipelining Loop: L i E i S i B Loop Loop: L i E i S i B Loop L 1 Loop: E i S i L i+1 B Loop E n S n L 1 Loop: E i S i L i+1 B Loop E n S n J Entry Loop: S i-1 Entry: L i E i B Loop S n J Entry Loop: S i-1 Entry: L i E i B Loop S n L 1 J Entry Loop: S i-1 Entry: E i L i+1 B Loop S n-1 E n S n L 1 J Entry Loop: S i-1 Entry: E i L i+1 B Loop S n-1 E n S n L 1 Loop: E i L i+1 S i B Loop E n S n L 1 Loop: E i L i+1 S i B Loop E n S n L 1 J Entry Loop: L i S i-1 Entry: E i B Loop S n L 1 J Entry Loop: L i S i-1 Entry: E i B Loop S n Maintains original L/S order Changes original L/S order

22 Other Compiler Techniques  Static Branch Prediction Examples: Examples:  predict always taken  predict never taken  predict: forward never taken, backward always taken Stall needed after LD Stall needed after LD  if branch almost always taken, and R7 not needed in fall-thru –move DADDU R7, R8, R9 to right after LD  if branch almost never taken, and R4 not needed on taken path –move OR instruction to right after LD LDR1, 0(R2) DSUBUR1, R1, R3 BEQZR1, L ORR4, R5, R6 DADDUR10, R4, R3 L:DADDUR7, R8, R9 LDR1, 0(R2) DSUBUR1, R1, R3 BEQZR1, L ORR4, R5, R6 DADDUR10, R4, R3 L:DADDUR7, R8, R9

23 Very Long Instruction Word (VLIW)  VLIW: compiler schedules multiple instructions/issue The long instruction word has room for many operations The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word can execute in parallel By definition, all the operations the compiler puts in the long instruction word can execute in parallel E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch  16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide Need very sophisticated compiling technique … Need very sophisticated compiling technique …  … that schedules across several branches

24 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9  Unrolled 7 times to avoid delays  7 results in 9 clocks, or 1.3 clocks/iter (down from 6)  Need more registers in VLIW

25 Trace Scheduling (briefly)  Parallelism across IF branches vs. LOOP branches  Two steps: Trace Selection Trace Selection  Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code Trace Compaction Trace Compaction  Squeeze trace into few VLIW instructions  Need bookkeeping code in case prediction is wrong

26 Trace Scheduling