Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

Slides:

Advertisements

Similar presentations

Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.

Advertisements

1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.

CALTECH CS137 Fall DeHon 1 CS137: Electronic Design Automation Day 19: November 21, 2005 Scheduling Introduction.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

ECE 667 Synthesis and Verification of Digital Circuits

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Courseware Integer Linear Programming approach to Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Constraint Programming for Compiler Optimization March 2006.

9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Winter 2005ICS 252-Intro to Computer Design ICS 252 Introduction to Computer Design Lecture 5-Scheudling Algorithms Winter 2005 Eli Bozorgzadeh Computer.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Multiscalar processors

Improving Code Generation Honors Compilers April 16 th 2002.

Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Constraint Programming An Appetizer Christian Schulte Laboratory of Electronics and Computer Systems Institute of Microelectronics.

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.

Data Flow Analysis Compiler Design Nov. 8, 2005.

Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.

Advanced Algorithm Design and Analysis Student: Gertruda Grolinger Supervisor: Prof. Jeff Edmonds CSE 4080 Computer Science Project.

Generic Software Pipelining at the Assembly Level Markus Pister

Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.

Network Aware Resource Allocation in Distributed Clouds.

Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

ASC2003 (July 15,2003)1 Uniformly Distributed Sampling: An Exact Algorithm for GA’s Initial Population in A Tree Graph H. S.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

An Efficient Algorithm for Scheduling Instructions with Deadline Constraints on ILP Machines Wu Hui Joxan Jaffar School of Computing National University.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Schreiber, Yevgeny. Value-Ordering Heuristics: Search Performance vs. Solution Diversity. In: D. Cohen (Ed.) CP 2010, LNCS 6308, pp Springer-

CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

Integer programming approaches to simultaneous scheduling and register allocation on the C6X Andrew Gilpin, Rebecca Hutchinson April 12, 2005.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Genetic algorithms for task scheduling problem J. Parallel Distrib. Comput. (2010) Fatma A. Omara, Mona M. Arafa 2016/3/111 Shang-Chi Wu.

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Studying the Impact of Bit Switching on CPU Energy Ghassan Shobaki, California State Univ., Sacramento Najm Eldeen Abu Rmaileh, Princess Sumaya Univ. for.

David W. Goodwin, Kent D. Wilken

Multiscalar Processors

Antonia Zhai, Christopher B. Colohan,

Local Instruction Scheduling

Instruction Scheduling for Instruction-Level Parallelism

Instruction Scheduling Hal Perkins Summer 2004

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Instruction Scheduling: Beyond Basic Blocks

CS 201 Compiler Construction

Instruction Scheduling Hal Perkins Winter 2008

Constraint Programming and Backtracking Search Algorithms

Static Code Scheduling

Instruction Scheduling: Beyond Basic Blocks

Instruction Scheduling Hal Perkins Autumn 2005

ICS 252 Introduction to Computer Design

Parallel Programming in C with MPI and OpenMP

Instruction Scheduling Hal Perkins Autumn 2011

Presentation transcript:

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer Science University of Waterloo University of Waterloo

2 Introduction Instruction scheduling is done in the back end of a compiler Instruction scheduling is done in the back end of a compiler Instruction scheduling is important to maximize Instruction Level Parallelism (ILP) Instruction scheduling is important to maximize Instruction Level Parallelism (ILP) Instruction scheduler tries to find an instruction order that minimizes execution time Instruction scheduler tries to find an instruction order that minimizes execution time Instruction scheduler must preserve program’s semantics and honor hardware constraints Instruction scheduler must preserve program’s semantics and honor hardware constraints

3 Types of instruction scheduling Scheduler’s scope is a sub-graph of a program’s control flow graph (CFG) Scheduler’s scope is a sub-graph of a program’s control flow graph (CFG) Local scheduling: single basic block Local scheduling: single basic block Global scheduling: multiple basic blocks: Global scheduling: multiple basic blocks: trace trace superblock superblock hyperblock hyperblock treegion treegion

4 The superblock Single-entry multiple-exit sequence of basic blocks Single-entry multiple-exit sequence of basic blocks Each exit node has weight, known as exit probability Each exit node has weight, known as exit probability Data and control dependencies and allowed code motions are represented by a Directed Acyclic Graph (DAG) Data and control dependencies and allowed code motions are represented by a Directed Acyclic Graph (DAG)

5 B E G C D I F H A Example of a DAG

6 Cost function for instruction scheduling B1 B3 B2 w1 w3 w2 Weighted completion time (W ct ) is the cost function for super-blocks W ct = w 1 (b 1 ) + w 2 (b 2 ) + w 3 (b 3 ) In general, W ct = ∑ i=0 w i b i superblock consisting of three basic-blocks B1, B2 and B3 n b1 b2 b3 Schedule length is the cost function for basic blocks

7 Previous work NP-Hard problem NP-Hard problem Heuristic solutions Heuristic solutions Optimal approaches: Optimal approaches: local: integer programming, enumeration and constraint programming, Heffernan and Wilken [2005] local: integer programming, enumeration and constraint programming, Heffernan and Wilken [2005] global: integer programming, enumeration using dynamic programming by Shobaki and Wilken [2004] global: integer programming, enumeration using dynamic programming by Shobaki and Wilken [2004]

8 List scheduling Most common method in practice Most common method in practice Approximate, greedy algorithm that runs fast in practice Approximate, greedy algorithm that runs fast in practice Data-ready instructions stored in a priority list Data-ready instructions stored in a priority list Priorities assigned according to heuristics Priorities assigned according to heuristics If ready list is not empty, schedule top priority instruction If ready list is not empty, schedule top priority instruction Else schedule a stall Else schedule a stall Advance to next issue slot Advance to next issue slot

9 Heuristics in list scheduling Basic block : C Critical path Super block: Critical path Successive retirement Dependence height and speculative yield (DHASY) G* Speculative hedge Balance scheduling

10 Constraint programming (CP) methodology We give a CP model, which is fast and optimal for almost all basic-blocks and super-blocks from the SPEC2000 benchmark We give a CP model, which is fast and optimal for almost all basic-blocks and super-blocks from the SPEC2000 benchmark CP Model CP Model define constraint model: variables, domains, constraints define constraint model: variables, domains, constraints add redundant constraints to reduce the search space add redundant constraints to reduce the search space Solve model Solve model backtracking along with constraint propagation backtracking along with constraint propagation

11 Constraint model example variables A, B, C, D, E, F, G domains {1, …, m} basic constraints dependency constraint: D  A + 1 G  F + 1 D  B + 1 G  D + 1 D  C + 1 F  E + 2 resource constraint: gcc( A, B, C, D, E, F, G, issue width )

12 CP model for instruction scheduling Six main types of constraint in the CP model for basic block and super block scheduling Six main types of constraint in the CP model for basic block and super block scheduling latency constraint latency constraint resource constraint resource constraint distance constraint distance constraint predecessor constraint predecessor constraint successor constraint successor constraint dominance constraint dominance constraint

13 Experimental results (basic block)

14 Experimental results (basic block): optimal vs. critical path

15 Experiment and results (superblock)

16 Experiments and results (superblock) : optimal scheduler vs. heuristic

17 Experiment and results (superblock) : optimal scheduler vs. heuristic

18 Compare to the works by Heffernan [2005] and Shobaki [2004] 1. CP optimal scheduler is more robust and scales better on large problems 2. CP optimal scheduler able to solve more harder problems 1. Test suite contains larger and more varied latencies 2. Test suite contains shorter latencies 3. Test suite contains larger basic blocks and super blocks

19 Conclusions CP approach to basic block and super block instruction scheduling CP approach to basic block and super block instruction scheduling multi-issue processors multi-issue processors arbitrary latencies arbitrary latencies Optimal and fast on very large, real problems Optimal and fast on very large, real problems Key was an improved constraint model Key was an improved constraint model

20 Future work Using CP to find an optimal schedule for a basic block for a given register pressure without spilling Using CP to find an optimal schedule for a basic block for a given register pressure without spilling Using CP for combined instruction scheduling and register allocation problem Using CP for combined instruction scheduling and register allocation problem

21 Work in progress Optimal basic block and super block instruction scheduling for realistic architecture, Mike [2006] Optimal basic block and super block instruction scheduling for realistic architecture, Mike [2006]

22 Acknowledgement IBM CAS Toronto Lab IBM CAS Toronto Lab Jim McInnes from IBM Toronto Lab Jim McInnes from IBM Toronto Lab Tyrell Russell and Michael Chase from University of Waterloo Tyrell Russell and Michael Chase from University of Waterloo

23 Thank You Questions!!!