EECS 583 – Class 11 Instruction Scheduling University of Michigan October 12, 2011.

Slides:



Advertisements
Similar presentations
1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECS 583 – Class 13 Software Pipelining University of Michigan October 24, 2011.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
ENGS 116 Lecture 111 ILP: Software Approaches 2 Vincent H. Berk October 14 th Reading for monday: 3.10 – 3.15, Reading for today: 4.2 – 4.6.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
COMP4611 Tutorial 6 Instruction Level Parallelism
Control-Flow Graphs & Dataflow Analysis CS153: Compilers Greg Morrisett.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
1 CS 201 Compiler Construction Lecture 7 Code Optimizations: Partial Redundancy Elimination.
Partial Redundancy Elimination. Partial-Redundancy Elimination Minimize the number of expression evaluations By moving around the places where an expression.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI 1994.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EECS 583 – Lecture 15 Machine Information, Scheduling a Basic Block University of Michigan March 5, 2003.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
EECS 583 – Class 14 Modulo Scheduling Reloaded University of Michigan October 31, 2011.
Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Announcements & Reading Material
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
EECS 583 – Class 5 Hyperblocks, Control Height Reduction University of Michigan September 19, 2012.
EECS 583 – Class 10 ILP Optimization and Intro. to Code Generation University of Michigan October 8, 2012.
15-740/ Computer Architecture Lecture 3: Performance
Lecture 5 Partial Redundancy Elimination
Instruction Scheduling for Instruction-Level Parallelism
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
EECS 583 – Class 13 Software Pipelining
Instruction Scheduling Hal Perkins Winter 2008
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Execution Cycle
EECS 583 – Class 8 Classic Optimization
EECS 583 – Class 14 Modulo Scheduling Reloaded
Static Code Scheduling
EECS 583 – Class 11 Instruction Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EECS 583 – Class 12 Superblock Scheduling Software Pipelining Intro
EECS 583 – Class 10 Finish Optimization Intro. to Code Generation
EECS 583 – Class 8 Finish SSA, Start on Classic Optimization
EECS 583 – Class 4 If-conversion
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EECS 583 – Class 4 If-conversion
Dynamic Hardware Prediction
EECS 583 – Class 12 Superblock Scheduling Intro. to Loop Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EECS 583 – Class 9 Classic and ILP Optimization
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EECS 583 – Class 4 If-conversion
EECS 583 – Class 12 Superblock Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Scheduling Hal Perkins Autumn 2011
Presentation transcript:

EECS 583 – Class 11 Instruction Scheduling University of Michigan October 12, 2011

- 1 - Reading Material + Announcements v Today’s class »“The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors,” P. Chang et al., IEEE Transactions on Computers, 1995, pp v Next class »“Sentinel Scheduling for VLIW and Superscalar Processors”, S. Mahlke et al., ASPLOS-5, Oct. 1992, pp v Reminder: HW 2 – Speculative LICM »Due Week from Fri  Get busy, go bug Daya if you are stuck! v Class project proposals »Week of Oct 24: Daya and I will meet with each group to discuss informal project proposal »Signup sheet available next week »Think about partners/topic!

- 2 - Homework Problem From Last Time - Answer r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 if (r2 < 400) goto loop loop: r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 if (r2 < 400) goto loop loop: Optimize the unrolled loop Renaming Tree height reduction Ind/Acc expansion r1 = load(r2) r5 = r1 + 3 r6 = r6 + r5 r2 = r2 + 4 r11 = load(r2) r15 = r r6 = r6 + r15 r2 = r2 + 4 r21 = load(r2) r25 = r r6 = r6 + r25 r2 = r2 + 4 if (r2 < 400) goto loop loop: after renaming and tree height reduction r1 = load(r2) r5 = r1 + 3 r6 = r6 + r5 r11 = load(r2+4) r15 = r r16 = r16 + r15 r21 = load(r2+8) r25 = r r26 = r26 + r25 r2 = r if (r2 < 400) goto loop r6 = r6 + r16 r6 = r6 + r26 r16 = r26 = 0 loop: after acc and ind expansion

- 3 - From Last Time: Dependences Flow OutputAnti r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6 r1 = r2 + r3 r2 = r5 * 6 Register Dependences Memory Dependences Mem-flow Mem-outputMem-anti store (r1, r2) r3 = load(r1) store (r1, r2) store (r1, r3) r2 = load(r1) store (r1, r3) Control (C1) if (r1 != 0) r2 = load(r1) Control Dependences

- 4 - From Last Time: Dependence Graph v Represent dependences between operations in a block via a DAG »Nodes = operations »Edges = dependences v Single-pass traversal required to insert dependences v Example 1: r1 = load(r2) 2: r2 = r1 + r4 3: store (r4, r2) 4: p1 = cmpp (r2 < 0) 5: branch if p1 to BB3 6: store (r1, r2) BB3:

- 5 - Dependence Graph Properties - Estart v Estart = earliest start time, (as soon as possible - ASAP) »Schedule length with infinite resources (dependence height) »Estart = 0 if node has no predecessors »Estart = MAX(Estart(pred) + latency) for each predecessor node »Example

- 6 - Lstart v Lstart = latest start time, ALAP »Latest time a node can be scheduled s.t. sched length not increased beyond infinite resource schedule length »Lstart = Estart if node has no successors »Lstart = MIN(Lstart(succ) - latency) for each successor node »Example

- 7 - Slack v Slack = measure of the scheduling freedom »Slack = Lstart – Estart for each node »Larger slack means more mobility »Example

- 8 - Critical Path v Critical operations = Operations with slack = 0 »No mobility, cannot be delayed without extending the schedule length of the block »Critical path = sequence of critical operations from node with no predecessors to exit node, can be multiple crit paths

- 9 - Class Problem NodeEstartLstartSlack Critical path(s) =

Operation Priority v Priority – Need a mechanism to decide which ops to schedule first (when you have multiple choices) v Common priority functions »Height – Distance from exit node  Give priority to amount of work left to do »Slackness – inversely proportional to slack  Give priority to ops on the critical path »Register use – priority to nodes with more source operands and fewer destination operands  Reduces number of live registers »Uncover – high priority to nodes with many children  Frees up more nodes »Original order – when all else fails

Height-Based Priority v Height-based is the most common »priority(op) = MaxLstart – Lstart(op) , 0 2, 22, 3 4, 4 6, 6 4, 77, 7 oppriority , , 1 0, 5 1 2

List Scheduling (aka Cycle Scheduler) v Build dependence graph, calculate priority v Add all ops to UNSCHEDULED set v time = -1 v while (UNSCHEDULED is not empty) »time++ »READY = UNSCHEDULED ops whose incoming dependences have been satisfied »Sort READY using priority function »For each op in READY (highest to lowest priority)  op can be scheduled at current time? (are the resources free?) u Yes, schedule it, op.issue_time = time âMark resources busy in RU_map relative to issue time âRemove op from UNSCHEDULED/READY sets u No, continue

Cycle Scheduling Example RU_map time ALU MEM m 3m 5m , 0 2, 22, 3 4, 4 6, 6 4, 7 7, , 8 7m 1 0, 1 0, Schedule time Ready Placed op priority

List Scheduling (Operation Scheduler) v Build dependence graph, calculate priority v Add all ops to UNSCHEDULED set v while (UNSCHEDULED not empty) »op = operation in UNSCHEDULED with highest priority »For time = estart to some deadline  Op can be scheduled at current time? (are resources free?) u Yes, schedule it, op.issue_time = time âMark resources busy in RU_map relative to issue time âRemove op from UNSCHEDULED u No, continue »Deadline reached w/o scheduling op? (could not be scheduled) u Yes, unplace all conflicting ops at op.estart, add them to UNSCHEDULED u Schedule op at estart âMark resources busy in RU_map relative to issue time âRemove op from UNSCHEDULED

Homework Problem – Operation Scheduling 1m 2 Machine: 2 issue, 1 memory port, 1 ALU Memory port = 2 cycles, pipelined ALU = 1 cycle 2m 4m m Calculate height-based priorities 2.Schedule using Operation scheduler 0,1 2,3 3,5 3,4 4,4 2,2 0,0 0,4 5,5 6,6 1 RU_map time ALU MEM Schedule time Ready Placed

Generalize Beyond a Basic Block v Superblock »Single entry »Multiple exits (side exits) »No side entries v Schedule just like a BB »Priority calculations needs change »Dealing with control deps

Lstart in a Superblock v Not a single Lstart any more »1 per exit branch (Lstart is a vector!) »Exit branches have probabilities Exit0 (25%) Exit1 (75%) opEstartLstart0Lstart

Operation Priority in a Superblock v Priority – Dependence height and speculative yield »Height from op to exit * probability of exit »Sum up across all exits in the superblock Exit0 (25%) Exit1 (75%) opLstart0Lstart1 Priority Priority(op) = SUM(Probi * (MAX_Lstart – Lstarti(op) + 1)) valid late times for op

Dependences in a Superblock 1: r1 = r2 + r3 2: r4 = load(r1) 3: p1 = cmpp(r3 == 0) 4: branch p1 Exit1 5: store (r4, -1) 6: r2 = r2 – 4 7: r5 = load(r2) 8: p2 = cmpp(r5 > 9) 9: branch p2 Exit * Data dependences shown, all are reg flow except 1  6 is reg anti * Dependences define precedence ordering of operations to ensure correct execution semantics * What about control dependences? * Control dependences define precedence of ops with respect to branches Superblock Note: Control flow in red bold

Conservative Approach to Control Dependences 1: r1 = r2 + r3 2: r4 = load(r1) 3: p1 = cmpp(r3 == 0) 4: branch p1 Exit1 5: store (r4, -1) 6: r2 = r2 – 4 7: r5 = load(r2) 8: p2 = cmpp(r5 > 9) 9: branch p2 Exit Superblock * Make branches barriers, nothing moves above or below branches * Schedule each BB in SB separately * Sequential schedules * Whole purpose of a superblock is lost Note: Control flow in red bold

Upward Code Motion Across Branches v Restriction 1a (register op) »The destination of op is not in liveout(br) »Wrongly kill a live value v Restriction 1b (memory op) »Op does not modify the memory »Actually live memory is what matters, but that is often too hard to determine v Restriction 2 »Op must not cause an exception that may terminate the program execution when br is taken »Op is executed more often than it is supposed to (speculated) »Page fault or cache miss are ok v Insert control dep when either restriction is violated … if (x > 0) y = z / x … 1: branch x <= 0 2: y = z / x control flow graph

Downward Code Motion Across Branches v Restriction 1 (liveness) »If no compensation code  Same restriction as before, destination of op is not liveout »Else, no restrictions  Duplicate operation along both directions of branch if destination is liveout v Restriction 2 (speculation) »Not applicable, downward motion is not speculation v Again, insert control dep when the restrictions are violated v Part of the philosphy of superblocks is no compensation code inseration hence R1 is enforced! … a = b * c if (x > 0) else … 1: a = b * c 2: branch x <= 0 control flow graph

Add Control Dependences to a Superblock 1: r1 = r2 + r3 2: r4 = load(r1) 3: p1 = cmpp(r2 == 0) 4: branch p1 Exit1 5: store (r4, -1) 6: r2 = r2 – 4 7: r5 = load(r2) 8: p2 = cmpp(r5 > 9) 9: branch p2 Exit SuperblockAssumed liveout sets {r1} {r2} {r5} Notes: All branches are control dependent on one another. If no compensation, all ops dependent on last branch All ops have cdep to op 9!

Class Problem 1: r1 = r : branch p1 Exit1 3: store (r1, -1) 4: branch p2 Exit2 5: r2 = load(r7) 6: r3 = r2 – 4 7: branch p3 Exit3 8: r4 = r3 / r8 {r4} {r1} {r4, r8} {r2} Draw the dependence graph