Instruction Scheduling: Beyond Basic Blocks Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

Slides:

Advertisements

Similar presentations

Instruction Scheduling combining scheduling with allocation Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Advertisements

Register Allocation Consists of two parts: Goal : minimize spills

Code Optimization, Part II Regional Techniques Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

1 Optimization Optimization = transformation that improves the performance of the target code Optimization must not change the output must not cause errors.

Operator Strength Reduction From Cooper, Simpson, & Vick, “Operator Strength Reduction”, ACM TOPLAS, 23(5), See also § of EaC2e. 1COMP 512,

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,

Instruction Scheduling, III Software Pipelining Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Intermediate Representations Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Code Shape III Booleans, Relationals, & Control flow Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Global Common Subexpression Elimination with Data-flow Analysis Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.

Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Introduction to Optimization, II Value Numbering & Larger Scopes Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.

Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.

Combining Scheduling & Allocation Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Local Register Allocation & Lab Exercise 1 Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

Order from Chaos — the big picture — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Order from Chaos — the big picture — C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Operator Strength Reduction C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Static Single Assignment John Cavazos.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Algebraic Reassociation of Expressions Briggs & Cooper, “Effective Partial Redundancy Elimination,” Proceedings of the ACM SIGPLAN 1994 Conference on Programming.

Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

Global Register Allocation via Graph Coloring Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.

Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

Building SSA Form, I 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Introduction to Optimization

Local Register Allocation & Lab Exercise 1

Finding Global Redundancies with Hopcroft’s DFA Minimization Algorithm

Local Instruction Scheduling

Introduction to Optimization

Instruction Scheduling Hal Perkins Summer 2004

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Intermediate Representations

Introduction to Code Generation

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Instruction Scheduling: Beyond Basic Blocks

CS 201 Compiler Construction

Building SSA Form COMP 512 Rice University Houston, Texas Fall 2003

Instruction Scheduling Hal Perkins Winter 2008

Local Instruction Scheduling — A Primer for Lab 3 —

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Code Shape III Booleans, Relationals, & Control flow

Intermediate Representations

Optimization through Redundancy Elimination: Value Numbering at Different Scopes COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith.

The Last Lecture COMP 512 Rice University Houston, Texas Fall 2003

Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Local Register Allocation & Lab Exercise 1

Introduction to Optimization

Instruction Scheduling: Beyond Basic Blocks

Instruction Scheduling Hal Perkins Autumn 2005

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Algebraic Reassociation of Expressions COMP 512 Rice University Houston, Texas Fall 2003 P. Briggs & K.D. Cooper, “Effective Partial Redundancy Elimination,”

The Partitioning Algorithm for Detecting Congruent Expressions COMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper.

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Instruction Scheduling Hal Perkins Autumn 2011

Presentation transcript:

Instruction Scheduling: Beyond Basic Blocks Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use.

Local Scheduling As long as we stay within a single block List scheduling does well Problem is hard, so tie-breaking matters  More descendants in dependence graph  Prefer operation with a last use over one with none  Breadth first makes progress on all paths  Tends toward more ILP & fewer interlocks  Depth first tries to complete uses of a value  Tends to use fewer registers Classic work on this is Gibbons & Muchnick

Local Scheduling Forward and backward can produce different results cbr cmpstore 1 store 2 store 3 store 4 store 5 add 1 add 2 add 3 add 4 addI loadI 1 lshiftloadI 2 loadI 3 loadI 4 Block from SPEC benchmark “go” Operation loadloadIaddaddIstorecmp Latency Latency to the cbr Subscript to identify

Local Scheduling Int Mem 1 loadI 1 lshift 2 loadI 2 loadI 3 3 loadI 4 add 1 4 add 2 add 3 5 add 4 addIstore 1 6 cmpstore 2 7 store 3 8 store 4 9 store cbr ForwardScheduleForwardSchedule Int Mem 1 loadI 4 2 addIlshift 3 add 4 loadI 3 4 add 3 loadI 2 store 5 5 add 2 loadI 1 store 4 6 add 1 store 3 7 store 2 8 store cmp 12 cbr 13 BackwardScheduleBackwardSchedule Using latency to root as the priority

Local Scheduling Schielke’s RBF algorithm Run 5 passes of forward list scheduling and 5 passes of backward list scheduling Break each tie randomly Keep the best schedule  Shortest time to completion  Other metrics are possible ( shortest time + fewest registers ) In practice, this does very well Randomized Backward & Forward

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B3

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths  {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts  Moving an op out of B 1 causes problems abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B3

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths  {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts  Moving an op out of B 1 causes problems abcdabcd g c,e f hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B3 no c here !

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths  {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts  Moving an op out of B 1 causes problems  Must insert “compensation” code in B 3  Increases code space abcdabcd cgcg c,e f hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B3 This one wasn’t done for speed!

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths  {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts  Moving an op into B 1 causes problems abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B3

Scheduling Larger Regions Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths  {B 1,B 2,B 4 } & {B 1,B 3 } Having B 1 in both causes conflicts  Moving an op into B 1 causes problems  Lengthens {B 1,B 3 }  Adds computation to {B 1,B 3 }  May need compensation code, too  Renaming may avoid “undo f ” a b c d,f undo f g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B3 This makes the path even longer!

Scheduling Larger Regions Superlocal Scheduling How much can we get?  Schielke saw 11 to 12% speed ups  Constrained away compensation code Why was this harder than DVNT ?  DVNT moved information  Scheduling moves ops  DVNT moves forward  Scheduling moves both ways  Value tables partition nicely  Dependence graph does not abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B3 Value numbering is the best case for superlocal scope

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B3 Join points create blocks that must work in multiple contexts 2 paths 3 paths

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine  Single successor, single predecessor abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B 6a B 5a B3B3 jkjk B 5b ll B 6b B 6c

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine  Single successor, single predecessor abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B 6a B 5a B3B3 jkjk B 5b ll B 6b B 6c

Scheduling Larger Regions More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine  Single successor, single predecessor Now schedule EBBs {B 1,B 2,B 4 }, {B 1,B 2,B 5q }, {B 1,B 3,B 5b }  Pay heed to compensation code Works well for forward motion Backward motion still has off-path problems  Speeding up one path can slow down others (undo) abcdabcd g efef hilhil jkljkl B1B1 B2B2 B4B4 B 5a B3B3 jkljkl B 5b

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges  Obtained by profiling abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B3

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges  Obtained by profiling Pick the “hot” path abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B Block counts could mislead us — see B 5

Scheduling Larger Regions Trace Scheduling Start with execution counts for edges  Obtained by profiling Pick the “hot” path  B 1,B 2,B 4,B 6 Schedule it  Compensation code in B 3,B 5 if needed  Get the hot path right! If we picked the right path, the other blocks do not matter as much  Places a premium on quality profiles abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B

Scheduling Larger Regions Trace Scheduling Entire CFG Pick & schedule hot path Insert compensation code Remove hot path from CFG Repeat the process until CFG is empty Idea Hot paths matter Farther off hot path, less it matters abcdabcd g efef hihi l jkjk B1B1 B2B2 B4B4 B6B6 B5B5 B3B