CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Slides:



Advertisements
Similar presentations
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Advertisements

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki
CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Compiler Techniques for ILP
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
CPE 631 Lecture 13: Exploiting ILP with SW Approaches
CS 5513 Computer Architecture Pipelining Examples
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
CS 704 Advanced Computer Architecture
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
CS 3853 Computer Architecture Pipelining Examples
Presentation transcript:

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for ILP exploitation:  BTB and branch prediction  Dynamic scheduling  Scoreboard  Tomasulo’s algorithm  Speculation  Multiple issue  How can compilers help?

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 2 Loop Unrolling  Let’s look at the code: for (i=1000;i>0;i=i-1) x[i] = x[i] + s ADD R2,R0,R0 Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 3 Scheduling On A Simple 5 Stage MIPS Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop stall one cycle, branch penalty 10 cycles

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 4 We Could Rearrange The Instructions Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop stall one cycle, branch penalty Interleave these inst. with some independent inst. Best we can achieve is 6 6 cycles Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop 8

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 5 Loop Unrolling  Getting into the loop more useful instructions and reducing overhead  Step 1: Put several iterations together Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Assume taken

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 6 Loop Unrolling  Step 2: Take out control instructions, adjust offsets Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 7 Loop Unrolling  Step 3: Rename registers Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 8 Loop Unrolling  Current loop still has stalls due to RAW dependencies Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop stall one cycle, branch penalty 28 cycles = 7 per it.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 9 Loop Unrolling  Step 4: Interleave iterations Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop 14 cycles = 3.5 per it. Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, 16(R1) BNE R1, R2, Loop S.D F16, 8(R1)

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 10 Loop Unrolling + Multiple Issue  Let’s unroll the loop 5 times, mark int. and FP operations Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) L.D F18,-32(R1) ADD.D F20, F18, F2 S.D F20, -32(R1) DADDUI R1, R1, #-40 BNE R1, R2, Loop

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 11 Loop Unrolling + Multiple Issue  Move all loads first, then ADD.D then S.D Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-40 BNE R1, R2, Loop

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 12 Loop Unrolling + Multiple Issue  Rearrange instructions to handle delay for DADDUI and BNE Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-40 BNE R1, R2, Loop Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) DADDUI R1, R1, #-40 S.D F16, -24(R1) BNE R1, R2, Loop S.D F20, -32(R1)

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 13 Loop Unrolling + Multiple Issue  Fix immediate displacement values Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) DADDUI R1, R1, #-40 S.D F16, 16(R1) BNE R1, R2, Loop S.D F20, 8(R1)

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 14 Loop Unrolling + Multiple Issue  Now imagine we can issue 2 instructions per cycle, one integer and one FP Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) DADDUI R1, R1, #-40 S.D F16, 16(R1) BNE R1, R2, Loop S.D F20, 8(R1) cycles = 2.4 per it.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 15 Static Branch Prediction  Analyze the code, figure out which outcome of a branch is likely  Always predict taken  Predict backward branches as taken, forward as not taken  Predict based on the profile of previous runs  Static branch prediction can help us schedule delayed branch slots

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 16 Static Multiple Issue: VLIW  Hardware checking for dependencies in issue packets may be expensive and complex  Compiler can examine instructions and decide which ones can be scheduled in parallel – group instructions into instruction packets – VLIW  Hardware can then be simplified  Processor has multiple functional units and each field of the VLIW is assigned to one unit  For example, VLIW could contain 5 fields and one has to contain ALU instruction or branch, two have to contain FP instructions and two have to be memory references

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 17 Example  Assume VLIW contains 5 fields: ALU instruction or branch, two FP instructions and two memory references  Ignore branch delay slot Memory reference FP instruction ALU instruction Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 18 Example  Unroll seven times and rearrange Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 1 ALU /branch FP mem 3

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 19 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 2 ALU /branch FP mem 3 4

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 20 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 3 3 ALU /branch FP mem 4 6 5

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 21 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 4 4 ALU /branch FP mem

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 22 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 5 ALU /branch FP mem

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 23 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 6 6 ALU /branch FP mem 7 9 8

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 24 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, 24(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 7 7 ALU /branch FP mem 9 8

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 25 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, 24(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 8 8 ALU /branch FP mem 9

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 26 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, 24(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 9 Overall 9 cycles for 7 iterations 1.29 per iteration But VLIW was always half-full ALU /branch FP mem

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 27 Detecting and Enhancing Loop Level Parallelism  Determine whether data in later iterations depends on data in earlier iterations – loop-carried dependence  Easier detected at source code level than at machine code for(i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i];/* S1 */ B[i+1] = B[i] + A[i+1] /* S2 */ } S1 calculates a value A[i+1] which will be used in next iteration of S1 S2 calculates a value B[i+1] which will be used in next iteration of S2  This is a loop-carried dependence and prevents parallelism S1 calculates a value A[i+1] which will be used in the current iteration of S2  This is dependence within the loop

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 28 Detecting and Enhancing Loop Level Parallelism for(i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i] /* S2 */ } S1 calculates a value A[i] which is not used in the future S2 calculates a value B[i+1] which will be used in next iteration of S1  This is a loop-carried dependence but S1 depends on S2 not on itself and S2 does not depend on S1 This loop can be made parallel if we transform it so that there is no loop-carried dependence A[1] = A[1] + B[1]; for(i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i] /* S2 */ A[i+1] = A[i+1] + B[i+1]; /* S1 */ } B[101] = C[100]+D[100]

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 29 Detecting and Enhancing Loop Level Parallelism  Recursion creates loop-carried dependence  But sometimes it may parallelizable if distance between dependent elements is >1 for(i=1; i<=100; i=i+1) { A[i] = A[i-1] + B[i]; } for(i=1; i<=100; i=i+1) { A[i] = A[i-5] + B[i]; }

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 30 Detecting and Enhancing Loop Level Parallelism  Find all dependencies in the following loop (5) and eliminate as many as you can: for(i=1; i<=100; i=i+1) { Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c – Y[i]; /* S4 */ } Solution at page 325

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 31 Code Transformation  Eliminating dependent computations  Copy propagation  Tree height reduction DADDUI R1, R2, #4 DADDUI R1, R1, #4  DADDUI R1, R2, #8 ADD R1, R2, R3 ADD R4, R1, R6 ADD R8, R4, R7 ADD R1, R2, R3 ADD R4, R6, R7 ADD R8, R1, R4  Can be done in parallel sum=sum+x /* suppose this is in a loop and we unroll it 5 times */ sum=sum+x1+x2+x3+x4+x5 sum=(sum+x1)+(x2+x3)+(x4+x5) Can be done in parallel Must be done sequentially 

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 32 Software Pipelining  Combining instructions from different loop iterations to separate dependent instructions within an iteration

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 33 Software Pipelining  Apply software pipelining technique to the following loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) R1+16 R1+8R S.D F0,16(R1) ADD.D F4, F0, F2 L.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop   Startup code Cleanup code

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 34 Software Pipelining vs. Loop Unrolling  Loop unrolling eliminates loop maintenance overhead exposing parallelism between iterations  Creates larger code  Software pipelining enables some loop iterations to run at top speed by eliminating RAW hazards that create latencies within iteration  Requires more complex transformations

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 35 Homework #8  Due Tuesday, November 16 by the end of the class  Submit either in class (paper) or by (PS or PDF only) or bring the paper copy to my office  Do exercises 4.2, 4.6, 4.9 (skip parts d. and e.), 4.11