Instruction Scheduling Hal Perkins Winter 2008

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Register Allocation Zach Ma.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1 CS 201 Compiler Construction Machine Code Generation.
11/21/2002© 2002 Hal Perkins & UW CSEO-1 CSE 582 – Compilers Instruction Scheduling Hal Perkins Autumn 2002.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Improving Code Generation Honors Compilers April 16 th 2002.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.
1 Code Generation Part II Chapter 8 (1 st ed. Ch.9) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.
Instruction Selection and Scheduling. The Problem Writing a compiler is a lot of work Would like to reuse components whenever possible Would like to automate.
Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.
2/22/2016© Hal Perkins & UW CSEP-1 CSE P 501 – Compilers Register Allocation Hal Perkins Winter 2008.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Chapter Six.
CS 352H: Computer Systems Architecture
Register Allocation Hal Perkins Autumn 2009
Morgan Kaufmann Publishers
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
Local Instruction Scheduling
Instruction Scheduling for Instruction-Level Parallelism
Register Allocation Hal Perkins Autumn 2011
Instruction Scheduling Hal Perkins Summer 2004
Instruction Scheduling: Beyond Basic Blocks
CS 201 Compiler Construction
Local Instruction Scheduling — A Primer for Lab 3 —
CSC 4250 Computer Architectures
CS 201 Compiler Construction
Chapter Six.
Register Allocation Hal Perkins Summer 2004
Register Allocation Hal Perkins Autumn 2005
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Static Code Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 16: Register Allocation
Instruction Scheduling: Beyond Basic Blocks
CS203 – Advanced Computer Architecture
Instruction Scheduling Hal Perkins Autumn 2005
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 17: Register Allocation via Graph Colouring
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Code Generation Part II
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Instruction Scheduling Hal Perkins Autumn 2011
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Instruction Scheduling Hal Perkins Winter 2008 CSE P 501 – Compilers Instruction Scheduling Hal Perkins Winter 2008 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Agenda Instruction scheduling issues – latencies List scheduling 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Issues (1) Many operations have non-zero latencies Modern machines can issue several operations per cycle Want to take advantage of multiple function units on chip Loads & Stores may or may not block may be slots after load/store for other useful work 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Issues (2) Branch costs vary Branches on some processors have delay slots Modern processors have heuristics to predict whether branches are taken and try to keep pipelines full GOAL: Scheduler should reorder instructions to hide latencies, take advantage of multiple function units and delay slots, and help the processor effectively pipeline execution 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Some Idealized Latencies Operation Cycles LOAD 3 STORE ADD 1 MULT 2 SHIFT BRANCH 0 TO 8 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Example: w = w*2*x*y*z; Simple schedule Loads early 1 LOAD r1 <- w 4 ADD r1 <- r1,r1 5 LOAD r2 <- x 8 MULT r1 <- r1,r2 9 LOAD r2 <- y 12 MULT r1 <- r1,r2 13 LOAD r2 <- z 16 MULT r1 <- r1,r2 18 STORE w <- r1 21 r1 free 2 registers, 20 cycles Loads early 1 LOAD r1 <- w 2 LOAD r2 <- x 3 LOAD r3 <- y 4 ADD r1 <- r1,r1 5 MULT r1 <- r1,r2 6 LOAD r2 <- z 7 MULT r1 <- r1,r3 9 MULT r1 <- r1,r2 11 STORE w <- r1 14 r1 is free 3 registers, 13 cycles 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Instruction Scheduling Problem Given a code fragment for some machine and latencies for each operation, reorder to minimize execution time Constraints Produce correct code Minimize wasted cycles Avoid spilling registers Do this efficiently 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Precedence Graph Nodes n are operations Attributes of each node type – kind of operation delay – latency If node n2 uses the result of node n1, there is an edge e = (n1,n2) in the graph 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Example Graph Code a LOAD r1 <- w b ADD r1 <- r1,r1 c LOAD r2 <- x d MULT r1 <- r1,r2 e LOAD r2 <- y f MULT r1 <- r1,r2 g LOAD r2 <- z h MULT r1 <- r1,r2 i STORE w <- r1 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Schedules (1) A correct schedule S maps each node n into a non-negative integer representing its cycle number, and S (n ) >= 0 for all nodes n (obvious) If (n1,n2) is an edge, then S(n1)+delay(n1) <= S(n2) For each type t there are no more operations of type t in any cycle than the target machine can issue 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Schedules (2) The length of a schedule S, denoted L(S) is L(S) = maxn ( S(n )+delay(n ) ) The goal is to find the shortest possible correct schedule Other possible goals: minimize use of registers, power, space, … 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Constraints Main points Collectively this makes scheduling NP-complete All operands must be available Multiple operations can be ready at any given point Moving operations can lengthen register lifetimes Moving uses near definitions can shorten register lifetimes Operations can have multiple predecessors Collectively this makes scheduling NP-complete Local scheduling is the simpler case Straight-line code Consistent, predictable latencies 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Algorithm Overview Build a precedence graph P Compute a priority function over the nodes in P (typical: longest latency-weighted path) Use list scheduling to construct a schedule, one cycle at a time Use queue of operations that are ready At each cycle Chose a ready operation and schedule it Update ready queue Rename registers to avoid false dependencies and conflicts 11/29/2018 © 2002-08 Hal Perkins & UW CSE

List Scheduling Algorithm Cycle = 1; Ready = leaves of P; Active = empty; while (Ready and/or Active are not empty) if (Ready is not empty) remove an op from Ready; S(op) = Cycle; Active = Active  op; Cycle++; for each op in Active if (S(op) + delay(op) <= Cycle) remove op from Active; for each successor s of op in P if (s is ready – i.e., all operands available) add s to Ready 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Example Code a LOAD r1 <- w b ADD r1 <- r1,r1 c LOAD r2 <- x d MULT r1 <- r1,r2 e LOAD r2 <- y f MULT r1 <- r1,r2 g LOAD r2 <- z h MULT r1 <- r1,r2 i STORE w <- r1 11/29/2018 © 2002-08 Hal Perkins & UW CSE

Variations Backward list scheduling Work from the root to the leaves Schedules instructions from end to beginning of the block In practice, try both and pick the result that minimizes costs Little extra expense since the precedence graph and other information can be reused Global scheduling and loop scheduling Extend basic idea in more aggressive compilers 11/29/2018 © 2002-08 Hal Perkins & UW CSE