Constraint Programming for Compiler Optimization March 2006.

Slides:



Advertisements
Similar presentations
Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
ECE 667 Synthesis and Verification of Digital Circuits
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Traveling Salesperson Problem
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Courseware Integer Linear Programming approach to Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)
Constraint Systems used in Worst-Case Execution Time Analysis Andreas Ermedahl Dept. of Information Technology Uppsala University.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.
Constraint Processing and Programming Introductory Exemple Javier Larrosa.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI 1994.
Constraint Satisfaction Problems
Jean-Charles REGIN Michel RUEHER ILOG Sophia Antipolis Université de Nice – Sophia Antipolis A global constraint combining.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
ICS 252 Introduction to Computer Design
Constraint Programming An Appetizer Christian Schulte Laboratory of Electronics and Computer Systems Institute of Microelectronics.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
Evaluation of Memory Consistency Models in Titanium.
Vilalta&Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Introduction to Job Shop Scheduling Problem Qianjun Xu Oct. 30, 2001.
Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE )
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
Artificial Intelligence CS482, CS682, MW 1 – 2:15, SEM 201, MS 227 Prerequisites: 302, 365 Instructor: Sushil Louis,
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Lecture 3: Uninformed Search
1 Lic Presentation Memory Aware Task Assignment and Scheduling for Multiprocessor Embedded Systems Radoslaw Szymanek / Embedded System Design
Basic Block Scheduling  Utilize parallelism at the instruction level (ILP)  Time spent in loop execution dominates total execution time  It is a technique.
L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Constraint Programming for the Diameter Constrained Minimum Spanning Tree Problem Thiago F. Noronha Celso C. Ribeiro Andréa C. Santos.
Studying the Impact of Bit Switching on CPU Energy Ghassan Shobaki, California State Univ., Sacramento Najm Eldeen Abu Rmaileh, Princess Sumaya Univ. for.
David W. Goodwin, Kent D. Wilken
CSCI1600: Embedded and Real Time Software
Instruction Scheduling Hal Perkins Winter 2008
Constraint Programming and Backtracking Search Algorithms
Architectural-Level Synthesis
Compiler Construction
How to improve (decrease) CPI
Instruction Scheduling Hal Perkins Autumn 2011
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Constraint Programming for Compiler Optimization March 2006

2 Acknowledgements Joint work with: Alexander Golynski Alejandro López-Ortiz Abid Malik Jim McInnes Claude-Guy Quimper John Tromp Kent Wilken Funding: NSERC IBM Canada

3 Optimization problems in compilers Instruction selection Instruction scheduling ·basic-block instruction scheduling ·super-block scheduling ·software pipelining & loop unrolling Register allocation Memory hierarchy optimizations

4 Basic-block instruction scheduling Schedule basic-block ·straight-line sequence of code with single entry, single exit Multiple-issue pipelined processors ·multiple instructions can begin execution each clock cycle ·delay or latency before results are available Find minimum length schedule Classic problem ·lots of attention in literature

5 Example: (a + b) + c instructions A r1  a B r2  b C r3  c D r1  r1 + r2 E r1  r1 + r AB DC E dependency DAG

6 Single-issue pipelined processor non-optimal schedule Ar1  a Br2  b nop Dr1  r1 + r2 Cr3  c nop Er1  r1 + r3 AB DC E dependency DAG

7 Single-issue pipelined processor optimal schedule Ar1  a Br2  b Cr3  c nop Dr1  r1 + r2 Er1  r1 + r3 AB DC E dependency DAG

8 Multiple-issue pipelined processor AB DC E dependency DAG A issue width is BCDE

9 Multiple-issue pipelined processor AB DC E dependency DAG A issue width is CBD E 6

10 Production compilers “At the outset, note that basic-block scheduling is an NP-hard problem, even with a very simple formulation of the problem, so we must seek an effective heuristic, rather than exact, approach.” Steven Muchnick, Advanced Compiler Design & Implementation, 1997

11 Optimal approaches state-of-the-art Single-issue Previous ·10-40 instructions ILP (Arya, 1985) CP (Ertl & Krall, 1991) ·up to 1000 instructions ILP (Wilken et al, 2000) Our work ·up to 2600 instructions ·20 × faster Multiple-issue Previous ·10-40 instructions ILP (Chang et al., 1997) DP (Kessler, 1998) ·up to 1000 instructions B&B (Heffernan et al., 2005) Our work ·up to 2600 instructions ·50-fold improvement

12 Constraint programming methodology Model problem ·specify in terms of constraints on acceptable solutions ·define/choose constraint model: variables, domains, constraints Solve model ·define/choose search algorithm ·define/choose heuristics

13 Constraint programming methodology Model problem ·specify in terms of constraints on acceptable solutions ·define/choose constraint model: variables, domains, constraints Solve model ·define/choose search algorithm ·define/choose heuristics

14 Minimal constraint model variables A, B, C, D, E domains {1, …, m} constraints D  A + 3 D  B + 3 E  C + 3 E  D + 1 gcc(A, B, C, D, E, width) AB DC E dependency DAG

15 Bounds consistency constraint propagation  [1, 3]  [4, 6] variable A B C D E domain [1, 6] D  A + 3 constraints  [4, 5]  [1, 3]  [4, 6]  [1, 3]  [1, 2] D  B + 3 E  C + 3 E  D + 1 gcc(A, B, C, D, E, 1)  [5, 6]  [1, 2]  [3, 3]  [6, 6]

16 Improvements to constraint model 1. Distance constraints constraints over nodes which define regions 2. Predecessor and successor constraints constraints over nodes with multiple predecessors or multiple successors 3. Safe pruning constraint global constraint 4. Dominance constraints constraints based on graph isomorphism

17 Improvements to constraint model 1. Distance constraints constraints over nodes which define regions 2. Predecessor and successor constraints constraints over nodes with multiple predecessors or multiple successors 3. Safe pruning constraint global constraint 4. Dominance constraints constraints based on graph isomorphism

18 Distance constraints: Regions A pair of nodes i, j define a region in a DAG G if: (i) there is more than one path from i to j, and (ii) not all paths from i to j go through some node k distinct from i and j. i j

19 Distance constraints: Estimate A B ED H FG C

20 Distance constraints: Estimate A B ED H FG C jj+1j+2j+3j+4j+5 5 A F

21 Distance constraints: Estimate A B ED H FG C jj+1j+2j+3j+4j+5 E H 5

22 Distance constraints: Estimate A B ED H FG C A jj+1j+2j+3j+4j+5 j+6j+7j+8j+9 H

23 Distance constraints: Optimal A B ED H FG C [1,1] [10,10] [2,3] [5,6] [6,7] [2,3] propagate latency propagate all-diff Not optimal: A  1 H  10 Estimate: H  A + 9

24 Distance constraints: Optimal Optimal: H  A + 10 A B ED H FG C [1,1] [10,10] [2,3] [5,6] [6,7] [2,3] propagate latency Not optimal: A  1 H  10 Estimate: H  A + 9 propagate all-diff inconsistent

25 Improvements to constraint model 1. Distance constraints constraints over nodes which define regions 2. Predecessor and successor constraints constraints over nodes with multiple predecessors or multiple successors 3. Safe pruning constraint global constraint 4. Dominance constraints constraints based on graph isomorphism

26 Predecessor constraints [4, ] 3 1 A B DCE H FG [,14] [5,9] [8,12] [9,12] [5,9][6,9] [5,8] 7 11

27 Predecessor constraints DE G A B C H F [4, ] [,14] 3 3 [5,9] [8,12] [9,12] [5,9][6,9] [5,8] 7 11  [9,12] 56789

28 Predecessor constraints [4, ] 3 1 A B DCE H FG [,14] [5,9] [8,12] [9,12] [5,9][6,9] [5,8] 7 11  [9,12]  [12,14]

29 Successor constraints [4, ] 3 1 A B DCE H FG [,14] [5,9] [8,12] [9,12] [5,9][6,9] [5,8] 7 11  [9,12]  [12,14]  [4,6] 6789

30 Constraint programming methodology Model problem ·specify in terms of constraints on acceptable solutions ·define/choose constraint model: variables, domains, constraints Solve model ·define/choose search algorithm ·define/choose heuristics

31 Solving instances of the model Use constraints to establish: ·lower bound on length m of optimal schedule ·min and max of domains of variables Backtracking search ·branches on min(x), min(x)+1, … ·interleave with bounds consistency constraint propagation ·fallback: singleton consistency on bounds If no solution found, increment m and repeat search

32 Solving instances of the model A B C D 1245 E AB DC E [1,5]

33 Solving instances of the model A B C D 1245 E AB DC E [ ]

34 Solving instances of the model A B C D 1256 E AB DC E [1,6]

35 Solving instances of the model A B C D 1256 E AB DC E [1,2] [5,5][3,3] [6,6]

36 Improvements to constraint solver Design special purpose constraint propagators ·commonly occurring constraints ·significantly improve efficiency Improved algorithms for bounds consistency ·all-diff constraint ·gcc constraint

37 Comparing all-diff propagators (prototype) Time (sec.) to solve instruction scheduling problems; model includes latency, distance, and all-diff constraints. DC: Régin, 1994; MT: Mehlhorn & Thiel, 2000; BC: IJCAI-2003

38 Comparing gcc propagators (prototype) Time (sec.) to solve instruction scheduling problems; model includes latency and gcc constraints; width is 2. DC: Régin, 1996; vH: van Hentenryck et al., 1992; BC: CP-2003

39 Putting it all together: Experimental results SPEC 2000 & MediaBench Benchmarks Total of 352,111 basic blocks of size 3 or greater Improved = improved schedule over heuristic scheduler Timed out = not solved within 10 minutes

40 Putting it all together: Experimental results SPEC 2000 & MediaBench Benchmarks For basic blocks with improved schedules

41 Conclusions CP approach to instruction scheduling ·Single-issue processors 20-times faster than previous best optimal approach ·Multiple-issue processors larger and more difficult problems 50-fold reduction in number of problems that cannot be solved Constraint propagators ·faster all-diff and gcc constraint propagators ·useful in many problems

42 Current and future work: Expand scope of problem Instruction selection Instruction scheduling ·basic-block instruction scheduling ·super-block scheduling ·software pipelining & loop unrolling Register allocation Memory hierarchy optimizations