Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe 26.035 ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Pipelining and Control Hazards Oct
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Pipelining II (1) Fall 2005 Lecture 19: Pipelining II.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Chapter 12 Pipelining Strategies Performance Hazards.
EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
CS61C L31 Pipelined Execution, part II (1) Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
CS203 – Advanced Computer Architecture Pipelining Review.
Pipeline Implementation (4.6)
Instruction Scheduling for Instruction-Level Parallelism
Instruction Scheduling Hal Perkins Summer 2004
Instruction Scheduling Hal Perkins Winter 2008
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Static Code Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CS203 – Advanced Computer Architecture
Instruction Scheduling Hal Perkins Autumn 2005
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSc 453 Final Code Generation
Instruction Scheduling Hal Perkins Autumn 2011
Presentation transcript:

Fall 2002 Lecture 14: Instruction Scheduling

Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to instruction scheduling List scheduling Resource constraints Interaction with register allocation Scheduling across basic blocks Trace scheduling Scheduling for loops Loop unrolling Software pipelining 5

Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction at a time For branch instructions, start fetching from a different location if needed –Check branch condition –Next instruction may come from a new location given by the branch instruction

Saman Amarasinghe ©MIT Fall 1998 Simple Execution Model 5 Stage pipe-line Fetch: get the next instruction Decode: figure-out what that instruction is Execute: Perform ALU operation –address calculation in a memory op Memory: Do the memory access in a mem. Op. Write Back: write the results back fetchdecodeexecutememorywriteback

Saman Amarasinghe ©MIT Fall 1998 Simple Execution Model IF DEEXEMEMWB IF DEEXEMEMWB Inst 1 Inst 2 time

Saman Amarasinghe ©MIT Fall 1998 Simple Execution Model IF DEEXEMEMWB IF DEEXEMEMWB Inst 1 Inst 2 time IF DEEXEMEMWB IF DEEXEMEMWB IF DEEXEMEMWB IF DEEXEMEMWB IF DEEXEMEMWB Inst 1 Inst 2 Inst 3 Inst 4 Inst 5

Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to instruction scheduling List scheduling Resource constraints Interaction with register allocation Scheduling across basic blocks Trace scheduling Scheduling for loops Loop unrolling Software pipelining 5

Saman Amarasinghe ©MIT Fall 1998 Handling Cond. Branch Instructions Does not know the location of the next instruction until later –after DE in jump instructions –after EXE in branch instructions

Saman Amarasinghe ©MIT Fall 1998 Handling Cond. Branch Instructions Does not know the location of the next instruction until later –after DE in jump instructions –after EXE in branch instructions IF DEEXEMEMWB Branch

Saman Amarasinghe ©MIT Fall 1998 Handling Cond. Branch Instructions Does not know the location of the next instruction until later –after DE in jump instructions –after EXE in branch instructions IF DEEXEMEMWB IF DEEXEMEMWB Branch Inst

Saman Amarasinghe ©MIT Fall 1998 Handling Cond. Branch Instructions Does not know the location of the next instruction until later –after DE in jump instructions –after EXE in branch instructions What to do with the middle 2 instructions? IF DEEXEMEMWB IF DEEXEMEMWB IF DEEXEMEMWB IF DEEXEMEMWB Branch ??? Inst

Saman Amarasinghe ©MIT Fall 1998 Handling Cond. Branch Instructions What to do with the middle 2 instructions? Stall the pipeline in case of a branch until we know the address of the next instruction –wasted cycles IF DEEXEMEMWB IF DEEXEMEMWB Branch Next inst

Saman Amarasinghe ©MIT Fall 1998 Handling Cond. Branch Instructions What to do with the middle 2 instructions? Delay the action of the branch –Make branch affect only after two instructions –Two instructions after the branch gets executed regardless of the branch IF DEEXEMEMWB IF DEEXEMEMWB IF DEEXEMEMWB IF DEEXEMEMWB Branch Next seq inst Branch target inst

Saman Amarasinghe ©MIT Fall 1998 Branch Delay Slot(s) MIPS has a branch delay slot –The instruction after the branch gets executed if the branch gets executed –Fetching from the Branch target takes place only after that

Saman Amarasinghe ©MIT Fall 1998 Branch Delay Slot(s) MIPS has a branch delay slot –The instruction after the branch gets executed if the branch gets executed –Fetching from the Branch target takes place only after that Branch delay slot ble r3, foo Branch delay slot

Saman Amarasinghe ©MIT Fall 1998 Branch Delay Slot(s) MIPS has a branch delay slot –The instruction after the branch gets executed if the branch gets executed –Fetching from the Branch target takes place only after that Branch delay slot ble r3, foo Branch delay slot What to put in the branch delay slot?

Saman Amarasinghe ©MIT Fall 1998 Filling The Branch Delay Slot Simple Solution: Put a no-op ble r3, foo Branch delay slot

Saman Amarasinghe ©MIT Fall 1998 Filling The Branch Delay Slot Simple Solution: Put a no-op Wasted instruction, just like a stall ble r3, foo noop Branch delay slot

Saman Amarasinghe ©MIT Fall 1998 Filling The Branch Delay Slot Move an instruction from above the branch ble r3, foo Branch delay slot

Saman Amarasinghe ©MIT Fall 1998 Filling The Branch Delay Slot Move an instruction from above the branch moved instruction executes iff branch executes –Get the instruction from the same basic block as the branch –Don’t move a branch instruction! instruction need to be moved over the branch –branch does not depend on the result of the inst. prev_instr ble r3, foo prev_instr Branch delay slot

Saman Amarasinghe ©MIT Fall 1998 Filling The Branch Delay Slot Move an instruction dominated by the branch instruction ble r3, foo dom_instr Branch delay slot foo: dom_instr

Saman Amarasinghe ©MIT Fall 1998 Filling The Branch Delay Slot Move an instruction from the target –Instruction dominated by target –No other ways to reach target (if so take care of them) –If conditional branch, instruction should not have a lasting effect if the branch is not taken ble r3, foo instr Branch delay slot foo: instr

Saman Amarasinghe ©MIT Fall 1998 Load Delay Slots Results of the loads is not available until end of MEM stage If the value of the load is used…what to do?? IF DEEXEMEMWB IF DEEXEMEMWB Load Use of load

Saman Amarasinghe ©MIT Fall 1998 Load Delay Slots If the value of the load is used…what to do?? Always stall one cycle Stall one cycle if next instruction uses the value –Need hardware to do this Have a delay slot for load –The new value is only available after two instructions –If next inst. uses the register, it’ll get the old value IF DEEXEMEMWB IF DEEXEMEMWB IF DEEXEMEMWB Load ??? Use of load

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) r3 = *(r1 + 8) r4 = r2 + r3 r5 = r2 - 1 goto L1

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) noop r3 = *(r1 + 8) noop r4 = r2 + r3 r5 = r2 - 1 goto L1 noop Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) noop r3 = *(r1 + 8) noop r4 = r2 + r3 r5 = r2 - 1 goto L1 noop Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) r3 = *(r1 + 8) noop r4 = r2 + r3 r5 = r2 - 1 goto L1 noop Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) r3 = *(r1 + 8) noop r4 = r2 + r3 r5 = r2 - 1 goto L1 noop Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) r3 = *(r1 + 8) noop r4 = r2 + r3 r5 = r2 - 1 goto L1 noop Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) r3 = *(r1 + 8) r5 = r2 - 1 r4 = r2 + r3 goto L1 noop Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) r3 = *(r1 + 8) r5 = r2 - 1 r4 = r2 + r3 goto L1 noop Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) r3 = *(r1 + 8) r5 = r2 - 1 r4 = r2 + r3 goto L1 noop Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) r3 = *(r1 + 8) r5 = r2 - 1 goto L1 r4 = r2 + r3 Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Example r2 = *(r1 + 4) r3 = *(r1 + 8) r5 = r2 - 1 goto L1 r4 = r2 + r3 Assume 1 cycle delay on branches and 1 cycle latency for loads

Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to instruction scheduling List scheduling Resource constraints Interaction with register allocation Scheduling across basic blocks Trace scheduling Scheduling for loops Loop unrolling Software pipelining 5

Saman Amarasinghe ©MIT Fall 1998 From a Simple Machine Model to a Real Machine Model Many pipeline stages –MIPS R4000 has 8 stages Different instructions taking different amount of time to execute –mult 10 cycles –div69 cycles –ddiv133 cycles Hardware to stall the pipeline if an instruction uses a result that is not ready

Saman Amarasinghe ©MIT Fall 1998 Real Machine Model cont. Most modern processors have multiple execution units (superscalar) –If the instruction sequence is correct, multiple operations will happen in the same cycles –Even more important to have the right instruction sequence

Saman Amarasinghe ©MIT Fall 1998 Instruction Scheduling Reorder instructions so that pipeline stalls are minimized

Saman Amarasinghe ©MIT Fall 1998 Constraints On Scheduling Data dependencies Control dependencies Resource Constraints

Saman Amarasinghe ©MIT Fall 1998 Data Dependency between Instructions If two instructions access the same variable, they can be dependent Kind of dependencies –True: write  read –Anti: read  write –Output: write  write What to do if two instructions are dependent. –The order of execution cannot be reversed –Reduce the possibilities for scheduling

Saman Amarasinghe ©MIT Fall 1998 Computing Dependencies For basic blocks, compute dependencies by walking through the instructions Identifying register dependencies is simple –is it the same register? For memory accesses –simple: base + offset1 ?= base + offset2 –data dependence analysis: a[2i] ?= a[2i+1] –interprocedural analysis: global ?= parameter –pointer alias analysis: p1 ?= p

Saman Amarasinghe ©MIT Fall 1998 Representing Dependencies Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies

Saman Amarasinghe ©MIT Fall 1998 Representing Dependencies Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies 1: r2 = *(r1 + 4) 2: r3 = *(r1 + 8) 3: r4 = r2 + r3 4: r5 = r2 - 1

Saman Amarasinghe ©MIT Fall 1998 Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies 1: r2 = *(r1 + 4) 2: r3 = *(r1 + 8) 3: r4 = r2 + r3 4: r5 = r Representing Dependencies 2 4

Saman Amarasinghe ©MIT Fall 1998 Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies 1: r2 = *(r1 + 4) 2: r3 = *(r1 + 8) 3: r4 = r2 + r3 4: r5 = r2 - 1 Edge is labeled with Latency: –v(i  j) = delay required between initiation times of i and j minus the execution time required by i 3 1 Representing Dependencies

Saman Amarasinghe ©MIT Fall 1998 Example 1: r2 = *(r1 + 4) 2: r3 = *(r2 + 4) 3: r4 = r2 + r3 4: r5 = r

Saman Amarasinghe ©MIT Fall 1998 Another Example 1: r2 = *(r1 + 4) 2: *(r1 + 4) = r3 3: r3 = r2 + r3 4: r5 = r

Saman Amarasinghe ©MIT Fall 1998 Another Example 1: r2 = *(r1 + 4) 2: *(r1 + 4) = r3 3: r3 = r2 + r3 4: r5 = r

Saman Amarasinghe ©MIT Fall 1998 Control Dependencies and Resource Constraints For now, lets only worry about basic blocks For now, lets look at simple pipelines

Saman Amarasinghe ©MIT Fall 1998 Example 1: LAr1,array 2: LDr2,4(r1) 3: ANDr3,r3,0x00FF 4: MULCr6,r6,100 5: STr7,4(r6) 6: DIVCr5,r5,100 7: ADDr4,r2,r5 8: MULr5,r2,r4 9: STr4,0(r1)

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1)

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 12

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 123

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234st 5

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234st 56

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234st 56

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234st 56 7

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234st 56 7

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234st 56 78

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234st

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234st cycles!

Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to instruction scheduling List scheduling Resource constraints Interaction with register allocation Scheduling across basic blocks Trace scheduling Scheduling for loops Loop unrolling Software pipelining 5

Saman Amarasinghe ©MIT Fall 1998 List Scheduling Algorithm Idea –Do a topological sort of the dependence DAG –Consider when an instruction can be scheduled without causing a stall –Schedule the instruction if it causes no stall and all its predecessors are already scheduled Optimal list scheduling is NP-complete –Use heuristics when necessary

Saman Amarasinghe ©MIT Fall 1998 List Scheduling Algorithm Create a dependence DAG of a basic block Topological Sort READY = nodes with no predecessors Loop until READY is empty Schedule each node in READY when no stalling Update READY

Saman Amarasinghe ©MIT Fall 1998 Heuristics for selection Heuristics for selecting from the READY list –pick the node with the longest path to a leaf in the dependence graph –pick a node with most immediate successors –pick a node that can go to a less busy pipeline (in a superscalar)

Saman Amarasinghe ©MIT Fall 1998 Heuristics for selection pick the node with the longest path to a leaf in the dependence graph Algorithm (for node x) –If no successors d x = 0 –d x = MAX( d y + c xy ) for all successors y of x –reverse breadth-first visitation order

Saman Amarasinghe ©MIT Fall 1998 Heuristics for selection pick a node with most immediate successors Algorithm (for node x): –f x = number of successors of x

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1)

Saman Amarasinghe ©MIT Fall 1998 Example : LAr1,array 2: LDr2,4(r1) 3: ANDr3,r3,0x00FF 4: MULCr6,r6,100 5: STr7,4(r6) 6: DIVCr5,r5,100 7: ADDr4,r2,r5 8: MULr5,r2,r4 9: STr4,0(r1)

Saman Amarasinghe ©MIT Fall 1998 Example

Saman Amarasinghe ©MIT Fall 1998 Example d=0

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { } 1, 3, 4, 6

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 6, 1, 4, 3 } 1, 3, 4, 6

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 6, 1, 4, 3 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 6, 1, 4, 3 } 6

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 1, 4, 3 } 6

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 1, 4, 3 } 6

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 1, 4, 3 } 61

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 4, 3 } 61 2

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 2, 4, 3 } 61

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 2, 4, 3 } 61

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 2, 4, 3 } 61

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 2, 4, 3 } 612

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 4, 3 } 612 7

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 4, 3 } 612

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 4, 3 } 612

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 4, 3 } 612

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 4, 3 } 612

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 4, 3 } 6124

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 3 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 3, 5 } 6124

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 3, 5 } 6124

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 3, 5 } 6124

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 7, 3, 5 } 61247

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 3, 5 } , 9

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 3, 5, 8, 9 } 61247

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 3, 5, 8, 9 } 61247

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 3, 5, 8, 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 5, 8, 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 5, 8, 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 5, 8, 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 5, 8, 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 8, 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 8, 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 8, 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 8, 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { 9 }

Saman Amarasinghe ©MIT Fall 1998 Example d=0 d=3 d=7d=4 d=5 f=1f=0 f=1 f=2 f=0 READY = { }

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 9 cycles

Saman Amarasinghe ©MIT Fall 1998 Example Results In 1: LAr1,array1 cycle 2: LDr2,4(r1)1 cycle 3: ANDr3,r3,0x00FF1 cycle 4: MULCr6,r6,1003 cycles 5: STr7,4(r6) 6: DIVCr5,r5,1004 cycles 7: ADDr4,r2,r51 cycle 8: MULr5,r2,r43 cycles 9: STr4,0(r1) 1234st cycles Vs 9 cycles