Instruction Scheduling Hal Perkins Autumn 2011

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 CS 201 Compiler Construction Machine Code Generation.

11/21/2002© 2002 Hal Perkins & UW CSEO-1 CSE 582 – Compilers Instruction Scheduling Hal Perkins Autumn 2002.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Improving Code Generation Honors Compilers April 16 th 2002.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.

Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.

10/1/2015© Hal Perkins & UW CSEG-1 CSE P 501 – Compilers Intermediate Representations Hal Perkins Autumn 2009.

Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.

1 Code Generation Part II Chapter 8 (1 st ed. Ch.9) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,

1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.

Instruction Selection and Scheduling. The Problem Writing a compiler is a lot of work Would like to reuse components whenever possible Would like to automate.

Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

2/22/2016© Hal Perkins & UW CSEP-1 CSE P 501 – Compilers Register Allocation Hal Perkins Winter 2008.

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Scheduling: Beyond Basic Blocks Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

Register Allocation Hal Perkins Autumn 2009

Morgan Kaufmann Publishers

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Pipeline Implementation (4.6)

Local Instruction Scheduling

Instruction Scheduling for Instruction-Level Parallelism

Register Allocation Hal Perkins Autumn 2011

Instruction Scheduling Hal Perkins Summer 2004

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Instruction Scheduling: Beyond Basic Blocks

Instruction Level Parallelism and Superscalar Processors

CS 201 Compiler Construction

Instruction Scheduling Hal Perkins Winter 2008

Local Instruction Scheduling — A Primer for Lab 3 —

CSC 4250 Computer Architectures

CS 201 Compiler Construction

How to improve (decrease) CPI

Register Allocation Hal Perkins Summer 2004

Register Allocation Hal Perkins Autumn 2005

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Static Code Scheduling

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 16: Register Allocation

Instruction Scheduling: Beyond Basic Blocks

Instruction Scheduling Hal Perkins Autumn 2005

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 17: Register Allocation via Graph Colouring

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Code Generation Part II

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Target Code Generation

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Instruction Scheduling Hal Perkins Autumn 2011 CSE P 501 – Compilers Instruction Scheduling Hal Perkins Autumn 2011 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Agenda Instruction scheduling issues – latencies List scheduling 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Issues (1) Many operations have non-zero latencies Modern machines can issue several operations per cycle Want to take advantage of multiple function units on chip Loads & Stores may or may not block may be slots after load/store for other useful work 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Issues (2) Branch costs vary Branches on some processors have delay slots Modern processors have heuristics to predict whether branches are taken and try to keep pipelines full GOAL: Scheduler should reorder instructions to hide latencies, take advantage of multiple function units and delay slots, and help the processor effectively pipeline execution 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Latencies for a Simple Example Machine Operation Cycles LOAD 3 STORE ADD 1 MULT 2 SHIFT BRANCH 0 TO 8 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Example: w = w*2*x*y*z; Simple schedule Loads early 1 LOAD r1 <- w 4 ADD r1 <- r1,r1 5 LOAD r2 <- x 8 MULT r1 <- r1,r2 9 LOAD r2 <- y 12 MULT r1 <- r1,r2 13 LOAD r2 <- z 16 MULT r1 <- r1,r2 18 STORE w <- r1 21 r1 free 2 registers, 20 cycles Loads early 1 LOAD r1 <- w 2 LOAD r2 <- x 3 LOAD r3 <- y 4 ADD r1 <- r1,r1 5 MULT r1 <- r1,r2 6 LOAD r2 <- z 7 MULT r1 <- r1,r3 9 MULT r1 <- r1,r2 11 STORE w <- r1 14 r1 is free 3 registers, 13 cycles 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Instruction Scheduling Problem Given a code fragment for some machine and latencies for each operation, reorder to minimize execution time Constraints Produce correct code Minimize wasted cycles Avoid spilling registers Do this efficiently 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Precedence Graph Nodes n are operations Attributes of each node type – kind of operation delay – latency If node n2 uses the result of node n1, there is an edge e = (n1,n2) in the graph 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Example Graph Code a LOAD r1 <- w b ADD r1 <- r1,r1 c LOAD r2 <- x d MULT r1 <- r1,r2 e LOAD r2 <- y f MULT r1 <- r1,r2 g LOAD r2 <- z h MULT r1 <- r1,r2 i STORE w <- r1 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Schedules (1) A correct schedule S maps each node n into a non-negative integer representing its cycle number, and S (n ) >= 0 for all nodes n (obvious) If (n1,n2) is an edge, then S(n1)+delay(n1) <= S(n2) For each type t there are no more operations of type t in any cycle than the target machine can issue 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Schedules (2) The length of a schedule S, denoted L(S) is L(S) = maxn ( S(n )+delay(n ) ) The goal is to find the shortest possible correct schedule Other possible goals: minimize use of registers, power, space, … 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Constraints Main points Collectively this makes scheduling NP-complete All operands must be available Multiple operations can be ready at any given point Moving operations can lengthen register lifetimes Moving uses near definitions can shorten register lifetimes Operations can have multiple predecessors Collectively this makes scheduling NP-complete Local scheduling is the simpler case Straight-line code Consistent, predictable latencies 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Algorithm Overview Build a precedence graph P Compute a priority function over the nodes in P (typical: longest latency-weighted path) Use list scheduling to construct a schedule, one cycle at a time Use queue of operations that are ready At each cycle Chose a ready operation and schedule it Update ready queue Rename registers to avoid false dependencies and conflicts 7/15/2019 © 2002-11 Hal Perkins & UW CSE

List Scheduling Algorithm Cycle = 1; Ready = leaves of P; Active = empty; while (Ready and/or Active are not empty) if (Ready is not empty) remove an op from Ready; S(op) = Cycle; Active = Active  op; Cycle++; for each op in Active if (S(op) + delay(op) <= Cycle) remove op from Active; for each successor s of op in P if (s is ready – i.e., all operands available) add s to Ready 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Example Code a LOAD r1 <- w b ADD r1 <- r1,r1 c LOAD r2 <- x d MULT r1 <- r1,r2 e LOAD r2 <- y f MULT r1 <- r1,r2 g LOAD r2 <- z h MULT r1 <- r1,r2 i STORE w <- r1 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Forward vs Backwards Backward list scheduling Work from the root to the leaves Schedules instructions from end to beginning of the block In practice, compilers try both and pick the result that minimizes costs Little extra expense since the precedence graph and other information can be reused Different directions win in different cases 7/15/2019 © 2002-11 Hal Perkins & UW CSE

Beyond Basic Blocks List scheduling dominates, but moving beyond basic blocks can improve quality of the code. Some possibilities: Schedule extended basic blocks Watch for exit points – limits reordering or requires compensating Trace scheduling Use profiling information to select regions for scheduling using traces (paths) through code 7/15/2019 © 2002-11 Hal Perkins & UW CSE