Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis www.ece.ucdavis.edu/aco/

Slides:

Advertisements

Similar presentations

Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.

Advertisements

1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

ECE 667 Synthesis and Verification of Digital Circuits

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)

Explicit Preemption Placement for Real- Time Conditional Code via Graph Grammars and Dynamic Programming Bo Peng, Nathan Fisher, and Marko Bertogna Department.

1 IIES 2008 Thomas Heinz (Saarland University, CR/AEA3) | 22/03/2008 | © Robert Bosch GmbH All rights reserved, also regarding any disposal, exploitation,

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 14: March 3, 2004 Scheduling Heuristics and Approximation.

Constraint Programming for Compiler Optimization March 2006.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

ISE480 Sequencing and Scheduling Izmir University of Economics ISE Fall Semestre.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Progress in Linear Programming Based Branch-and-Bound Algorithms

S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.

Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.

EDA (CS286.5b) Day 11 Scheduling (List, Force, Approximation) N.B. no class Thursday (FPGA) …

Ch 13 – Backtracking + Branch-and-Bound

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

1 IOE/MFG 543 Chapter 7: Job shops Sections 7.1 and 7.2 (skip section 7.3)

ECE669 L10: Graph Applications March 2, 2004 ECE 669 Parallel Computer Architecture Lecture 10 Graph Applications.

Generic Software Pipelining at the Assembly Level Markus Pister

Embedded System Design Framework for Minimizing Code Size and Guaranteeing Real-Time Requirements Insik Shin, Insup Lee, & Sang Lyul Min CIS, Penn, USACSE,

Bold Stroke January 13, 2003 Advanced Algorithms CS 539/441 OR In Search Of Efficient General Solutions Joe Hoffert

Introduction to Job Shop Scheduling Problem Qianjun Xu Oct. 30, 2001.

P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 12: February 13, 2002 Scheduling Heuristics and Approximation.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

Dominance and Indifference in Airline Planning Decisions NEXTOR Conference: INFORMS Aviation Session June 2 – 5, 2003 Amy Mainville Cohn, KoMing Liu, and.

An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Product A Product B Product C A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 B4B4 C1C1 C3C3 C4C4 Turret lathes Vertical mills Center lathes Drills From “Fundamentals of.

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

1 Chapter 5 Branch-and-bound Framework and Its Applications.

Studying the Impact of Bit Switching on CPU Energy Ghassan Shobaki, California State Univ., Sacramento Najm Eldeen Abu Rmaileh, Princess Sumaya Univ. for.

Design and Analysis of Algorithm

CSCI1600: Embedded and Real Time Software

CS 201 Compiler Construction

Evaluation and Validation

Instruction Scheduling Hal Perkins Winter 2008

Constraint Programming and Backtracking Search Algorithms

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 16: Register Allocation

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Scheduling Hal Perkins Autumn 2011

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

2 Outline Background Background Existing Solutions Existing Solutions Optimal Solution Optimal Solution Experimental Results Experimental Results Summary and Future Work Summary and Future Work

3 Overview “Instruction Scheduling is the most fundamental ILP-oriented phase”. [Josh Fisher et al., “Embedded Computing”] “Instruction Scheduling is the most fundamental ILP-oriented phase”. [Josh Fisher et al., “Embedded Computing”] Scheduler tries to find an instruction order that minimizes pipeline stalls Scheduler tries to find an instruction order that minimizes pipeline stalls Schedule must preserve program’s semantics and honor hardware constraints Schedule must preserve program’s semantics and honor hardware constraints

4 Elements of Instruction Scheduling Region Formation Region Formation Schedule Construction (the focus of our research) Schedule Construction (the focus of our research)

5 Region Formation Scheduler’s scope is a sub-graph of the program’s control flow graph (CFG) Scheduler’s scope is a sub-graph of the program’s control flow graph (CFG) Local scheduling: single basic block Local scheduling: single basic block Global scheduling: multiple basic blocks: Global scheduling: multiple basic blocks: Trace Trace Superblock and hyperblock Superblock and hyperblock Treegion Treegion General acyclic: e.g. Wavefront (2000) General acyclic: e.g. Wavefront (2000)

6 Schedule Construction NP-Hard problem for realistic machines NP-Hard problem for realistic machines Heuristic Solutions: Virtually all production compilers and most research Heuristic Solutions: Virtually all production compilers and most research Optimal Approaches: Recent research Optimal Approaches: Recent research Local: Integer Programming and enumeration Local: Integer Programming and enumeration Global: Integer Programming Global: Integer Programming

7 The Superblock Single-entry multiple-exit sequence of basic blocks Single-entry multiple-exit sequence of basic blocks Data and control dependencies and allowed code motions are represented by a Directed Acyclic Graph (DAG) Data and control dependencies and allowed code motions are represented by a Directed Acyclic Graph (DAG)

8 B E G C D I F H A Example Superblock DAG ABCABC G H I ABCABC DEFDEF

9 List Scheduling Most common method in practice Most common method in practice Approximate greedy algorithm that runs fast in practice Approximate greedy algorithm that runs fast in practice Data-ready instructions stored in a priority list Data-ready instructions stored in a priority list Priorities assigned according to heuristics Priorities assigned according to heuristics If ready list is not empty, schedule top priority instruction If ready list is not empty, schedule top priority instruction Else schedule a stall Else schedule a stall Advance to next issue slot Advance to next issue slot

10 Critical-Path Heuristic B E G C D I F H A Cycle Instruction 0 A 1 B 2 G 3 C 4 D 5 H 6 E 7 F 8 I

11 Superblock Heuristics Critical Path Critical Path Successive Retirement Successive Retirement Dependence height and speculative yield (DHASY) Dependence height and speculative yield (DHASY) G* G* Speculative Hedge Speculative Hedge Balance Scheduling Balance Scheduling

12 Optimal Scheduling Can make improvement over heuristics Can make improvement over heuristics Accurate heuristic methods are already complex Accurate heuristic methods are already complex In some applications, longer compile times can be tolerated In some applications, longer compile times can be tolerated Reference for evaluating accuracy of heuristics and studying ILP limits Reference for evaluating accuracy of heuristics and studying ILP limits

13 Objective S : A given schedule P i : Probability of exit i D i : Delay of exit i from its lower bound L i E : # of side exits Find a schedule with minimum cost

14 B E G C D I F H A [0,0] [6,7] [1,2] [2,3] [3,4] [3,6] [1,4] [2,5] [8,8] Cycle Instruction 0 A 1 B 2 G 3 C 4 D 5 H 6 E 7 F 8 I Cost Function Example: CP Cost = 0.3* * *0 = 0.5

15 Heuristic Solution Lower Bounds Cost = 0 YES NO Optimal Algorithm Fix BranchesEnumerate Feasible Done YES NO

16 Enumeration List scheduling with backtracking List scheduling with backtracking Explores one target length at a time Explores one target length at a time A subset of instructions can be fixed A subset of instructions can be fixed Branch-and-Bound approach with four feasibility tests (pruning techniques) Branch-and-Bound approach with four feasibility tests (pruning techniques) - Node superiority - LB tightening - History-based domination - Relaxed Scheduling

17 Enumeration Example I2I2 I3I3 I1I1 I4I4 I5I I1I1 I2I2 I3I3 stall I2I2 I3I3 I4I4 I5I5 Infeasible! Backtrack Target length = 4

18 Branch Combinations & Subset Sum Branch Combination Problem is NP- Complete! Branch Combination Problem is NP- Complete! Can be reduced to Subset Sum Can be reduced to Subset Sum In practice, the number of branches and ranges are small. In practice, the number of branches and ranges are small. Solved efficiently using Dynamic Programming Solved efficiently using Dynamic Programming

19 B E G C D I F H A [0,0] [6,7] [1,2] [2,3] [3,4] [3,6] [1,4] [2,5] [8,8] Start with CP heuristic Cost = 0.5 Only length 8 is interesting Branch Comb C F Cost (0, 0) (0, 1) (1, 0) Complete Example

20 0 : A 1 : B 2 : C 3 : D 4 : G 5 : E A Relaxed Sched H X ? Infeasible Branch Combination (0,0) Cost = 0.0 B E G C D I F H A [0,0] [6,6] [1,1] [2,2] [3,3] [3,5] [1,4] [2,5] [8,8]

21 A G E D E H E E F I H B C G D G Optimal Schedule A, B, C, G, D, H, E, F, I with cost 0.2 B E G C D I F H A [0,0] [7,7] [1,1] [2,2] [3,4] [3,6] [1,4] [2,5] [8,8] Branch Combination (0,1) Cost = 0.2

22 Experimental Results Superblocks imported from GCC using SPEC CPU2000, FP and INT Superblocks imported from GCC using SPEC CPU2000, FP and INT Scheduled for 4 machine models: Scheduled for 4 machine models: single-issue single-issue dual-issue dual-issue quad-issue quad-issue six-issue. six-issue. Time limit set to 1 second per problem Time limit set to 1 second per problem

23 Superblock Statistics FP2000INT200 MaxAvgMaxAvg DAG Size Exit Count Final-Exit Probability (%) Side-Exit

24 INT2000 Results Issue Rate 1246Avg Hard Blocks %Timeouts Avg Soln Time (ms) %Improved Blocks % Cycle Improvement

25 Summary & Future Work An optimal superblock scheduling technique has been developed An optimal superblock scheduling technique has been developed About 99% of hard problems solved within 1 sec About 99% of hard problems solved within 1 sec 80% improved 80% improved Next Goal: explore other global regions. Trace is strongest candidate Next Goal: explore other global regions. Trace is strongest candidate