Combining Scheduling & Allocation Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Slides:

Advertisements

Similar presentations

Instruction Scheduling combining scheduling with allocation Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Advertisements

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Operator Strength Reduction From Cooper, Simpson, & Vick, “Operator Strength Reduction”, ACM TOPLAS, 23(5), See also § of EaC2e. 1COMP 512,

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 14: March 3, 2004 Scheduling Heuristics and Approximation.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

SSA-Based Constant Propagation, SCP, SCCP, & the Issue of Combining Optimizations 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon,

The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.

Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

4/23/09Prof. Hilfinger CS 164 Lecture 261 IL for Arrays & Local Optimizations Lecture 26 (Adapted from notes by R. Bodik and G. Necula)

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

2005 International Symposium on Code Generation and Optimization Progressive Register Allocation for Irregular Architectures David Koes

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Improving Code Generation Honors Compilers April 16 th 2002.

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Instruction Selection Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Chapter 91 Memory Management Chapter 9   Review of process from source to executable (linking, loading, addressing)   General discussion of memory.

Overview of the Course. Critical Facts Welcome to CISC 672 — Advanced Compiler Construction Instructor: Dr. John Cavazos Office.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 12: February 13, 2002 Scheduling Heuristics and Approximation.

The Procedure Abstraction, Part V: Support for OOLs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

Building SSA Form, III 1COMP 512, Rice University This lecture presents the problems inherent in out- of-SSA translation and some ways to solve them. Copyright.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

CAS 721 Course Project Implementing Branch and Bound, and Tabu search for combinatorial computing problem By Ho Fai Ko ( )

Lecture 3: Uninformed Search

15.053Tuesday, April 9 Branch and Bound Handouts: Lecture Notes.

Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Finding Optimal Solutions to Cooperative Pathfinding Problems Trevor Standley and Rich Korf Computer Science Department University of California, Los Angeles.

Dead Code Elimination This lecture presents the algorithm Dead from EaC2e, Chapter 10. That algorithm derives, in turn, from Rob Shillner’s unpublished.

Boolean & Relational Values Control-flow Constructs Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Instruction Scheduling: Beyond Basic Blocks Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp.

Introduction to Optimization

Objectives of the Course and Preliminaries

Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Local Instruction Scheduling

Introduction to Optimization

Instruction Scheduling Hal Perkins Summer 2004

Introduction to Code Generation

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.

Instruction Scheduling: Beyond Basic Blocks

Instruction Scheduling Hal Perkins Winter 2008

Local Instruction Scheduling — A Primer for Lab 3 —

The Last Lecture COMP 512 Rice University Houston, Texas Fall 2003

Introduction to Optimization

Instruction Scheduling: Beyond Basic Blocks

Instruction Scheduling Hal Perkins Autumn 2005

Lecture 17: Register Allocation via Graph Colouring

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Instruction Scheduling Hal Perkins Autumn 2011

Presentation transcript:

Combining Scheduling & Allocation Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. COMP 412 FALL 2010 The Last Lecture

Comp 412, Fall Combining Scheduling & Allocation Sometimes, combining two optimizations can produce solutions that cannot be obtained by solving them independently. Requires bilateral interactions between optimizations —Click and Cooper, “Combining Analyses, Combining Optimizations”, TOPLAS 17(2), March Combining two optimizations can be a challenge ( SCCP ) Scheduling & allocation are a classic example Scheduling changes variable lifetimes Renaming in the allocator changes dependences Spilling changes the underlying code false dependences

Comp 412, Fall Many authors have tried to combine allocation & scheduling Underallocate to leave “room” for the scheduler —Can result in underutilization of registers Preallocate to use all registers —Can create false dependences Solving the problems together can produce solutions that cannot be obtained by solving them independently —See Click and Cooper, “Combining Analyses, Combining Optimizations”, TOPLAS 17(2), March In general, these papers try to combine global allocators with local or regional schedulers — an algorithmic mismatch Combining Scheduling & Allocation Before we go there, a long digression about how much improvement we might expect …

Comp 412, Fall Iterative Repair Scheduling The Problem List scheduling has dominated field for 20 years Anecdotal evidence both good & bad, little solid evidence No intuitive paradigm for how it works It works well, but will it work well in the future ? Is there room for improvement? ( e.g., with allocation? ) Schielke’s Idea Try more powerful algorithms from other domains Look for better schedules Look for understanding of the solution space This led us to iterative repair scheduling

Comp 412, Fall Iterative Repair Scheduling The Algorithm Start from some approximation to a schedule ( bad or broken ) Find & prioritize all cycles that need repair ( tried 6 schemes ) —Either resource or data constraints Perform the needed repairs, in priority order —Break ties randomly —Reschedule dependent operations, in random order —Evaluation function on repair can reject the repair ( try another ) Iterate until repair list is empty Repeat this process many times to explore the solution space —Keep the best result ! Randomization & restart is a fundamental theme of our recent work Iterative repair works well on many kinds of scheduling problems. Scheduling cargo for the space shuttle Typical problems in the literature involve 10s or 100s of repairs We used it with millions of repairs Iterative repair works well on many kinds of scheduling problems. Scheduling cargo for the space shuttle Typical problems in the literature involve 10s or 100s of repairs We used it with millions of repairs

Comp 412, Fall Iterative Repair Scheduling How does iterative repair do versus list scheduling? Found many schedules that used fewer registers Found very few faster schedules Were disappointed with the results Began a study of the properties of scheduling problems Iterative repair, itself, doesn’t justify the additional costs Can we identify schedulers where it will win? Can we learn about the properties of scheduling problems ? —And about the behavior of list scheduling... Hopeful sign for this lecture

Comp 412, Fall Methodology Looked at blocks & extended blocks in benchmark programs Used his RBF algorithm & tested for optimality If non-optimal, used IR to find its best schedule ( simple tests ) Checked these results against an IP formulation using CPLEX The Results List scheduling 1 does quite well on a conventional uniprocessor Over 92% of blocks scheduled optimally for speed Over 73% of extended blocks scheduled optimally for speed CPLEX had a hard time with the easy blocks —Too many optimal solutions to investigate Instruction Scheduling Study These results were obtained with code from benchmark programs. Recall, from the local scheduling lecture, that RBF generated optimal schedules for 80% of the randomly generated blocks. Holes in schedule? Delays on critical path? Holes in schedule? Delays on critical path?

Comp 412, Fall Combining Allocation & Scheduling The Problem Well-understood that the problems are intricately related Previous work under-allocates or under-schedules —Except Goodman & Hsu Our Approach Formulate an iterative repair framework —Moves for scheduling, as before —Moves to decrease register pressure or to spill Allows fair competition in a combined attack Grows out of search for novel techniques from other areas Back to today’s subject

Comp 412, Fall Combining Allocation & Scheduling The Details Run IR scheduler & keep the schedule with lowest demand for registers ( register pressure ) Start with ALAP schedule rather than ASAP schedule Reject any repair that increases maximum pressure Cycle with pressure > k triggers “pressure repair” —Identify ops that reduce pressure & move one —Lower threshold for k seems to help Ran it against the classic method —Schedule, allocate, schedule ( using Briggs’ allocator )

Comp 412, Fall Combining Allocation & Scheduling The Results Many opportunities to lower pressure —12% of basic blocks —33% of extended blocks These schedule may be faster, too —Best case was 41.3% ( procedure ) —Average case, 16 regs, was 5.4% —Average case, 32 regs, was 3.5% ( whole applications ) This approach finds faster codes that spill fewer values It is competing against a very good global allocator —Rematerialization catches many of the same effects Knowing that new solutions exist does not ensure that they are better solutions! This work confirms years of suspicion, while providing an effective, albeit nontraditional, technique Knowing that new solutions exist does not ensure that they are better solutions! This work confirms years of suspicion, while providing an effective, albeit nontraditional, technique The opportunity is present, but the IR scheduler is still quite slow …

Comp 412, Fall Balancing Speed and Register Pressure Goodman & Hsu proposed a novel scheme Context: debate about prepass versus postpass scheduling Problem: tradeoff between allocation & scheduling Solution: —Schedule for speed until fewer than Threshold registers —Schedule for registers until more than Threshold registers Details: —“for speed” means one of the latency-weighted priorities —“for registers” means an incremental adaptation of SU scheme James R. Goodman and Wei-Chung Hsu, “Code Scheduling and Register Allocation in Large Basic Blocks,” Proceedings of the 2 nd International Conference on Supercomputing, St. Malo, France, 1988, pages Other approaches in the literature

Comp 412, Fall Local Scheduling & Register Allocation List scheduling is a local, incremental algorithm Decisions made on an operation-by-operation basis Use local (basic-block level) metrics Need a local, incremental register-allocation algorithm Best’s algorithm, called “bottom-up local” in EaC —To free a register, evict the value with furthest next use Uses local (basic-block level) metrics Combining these two algorithms leads to a fair, local algorithm for the combined problem —Idea is due to Dae-Hwan Kim & Hyuk-Jae Lee —Can use a non-local eviction heuristic ( new twist on Best’s alg. ) See Dae-Hwan Kim and Hyuk-Jae Lee, “Integrated instruction scheduling and fine-grain register allocation for embedded processors,” LNCS 4017, pages , July 2006 (6th Int’l Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS 2006) Samos, Greece) See Dae-Hwan Kim and Hyuk-Jae Lee, “Integrated instruction scheduling and fine-grain register allocation for embedded processors,” LNCS 4017, pages , July 2006 (6th Int’l Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS 2006) Samos, Greece)

Comp 412, Fall Original Code for Local List Scheduling Cycle  1 Ready  leaves of D Active  Ø while (Ready  Active  Ø) if (Ready  Ø) then remove an op from Ready S(op)  Cycle Active  Active  op Cycle  Cycle + 1 update the Ready queue Cycle  1 Ready  leaves of D Active  Ø while (Ready  Active  Ø) if (Ready  Ø) then remove an op from Ready S(op)  Cycle Active  Active  op Cycle  Cycle + 1 update the Ready queue Paraphrasing from the local scheduling lecture …

Comp 412, Fall The Combined Algorithm Cycle  1 Ready  leaves of D Active  Ø while (Ready  Active  Ø) if (Ready  Ø) then remove an op from Ready make operands available in registers allocate a register for target S(op)  Cycle Active  Active  op Cycle  Cycle + 1 update the Ready queue Reload Live on Exit values, if necessary Cycle  1 Ready  leaves of D Active  Ø while (Ready  Active  Ø) if (Ready  Ø) then remove an op from Ready make operands available in registers allocate a register for target S(op)  Cycle Active  Active  op Cycle  Cycle + 1 update the Ready queue Reload Live on Exit values, if necessary Bottom-up local: Keep a list of free registers On last use, put register back on free list To free register, store value used farthest in the future Fast, simple, & effective

Notes on the Final Exam Closed-notes, closed-book exam Exam available Wednesday. Three hour time limit —I aimed for a two-hour exam, but I don’t want you to feel time pressure. You may take one break of up to fifteen minutes apiece. You are responsible for the entire course —Exam focuses primarily on material since the midterm —Chapters 5, 6, 7, 8, 9.1, 9.2, 11, 12, & 13 —All the lecture notes Return the exam to DH 3080 (Penny Anderson’s office) by 5PM on the last day of exams – December 15, 2010 If you must leave, you can me a Word file or a PDF document. Comp 412, Fall

Comp 412, Fall Scheilke’s RBF Algorithm for Local Scheduling Relying on randomization & restart, we can smooth the behavior of classic list scheduling algorithms Schielke’s RBF algorithm Run 5 passes of forward list scheduling and 5 passes of backward list scheduling Break each tie randomly Keep the best schedule —Shortest time to completion —Other metrics are possible ( shortest time + fewest registers ) In practice, this approach does very well —Reuses the dependence graph Randomized Backward & Forward Randomized Backward & Forward My “algorithm of choice” for list scheduling …