CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Advertisements

Approximations of points and polygonal chains

Advanced Topics in Algorithms and Data Structures Lecture 7.2, page 1 Merging two upper hulls Suppose, UH ( S 2 ) has s points given in an array according.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

C&O 355 Mathematical Programming Fall 2010 Lecture 20 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A.

CSE 20 – Discrete Mathematics Dr. Cynthia Bailey Lee Dr. Shachar Lovett Peer Instruction in Discrete Mathematics by Cynthia Leeis licensed under a Creative.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

How should we define corner points? Under any reasonable definition, point x should be considered a corner point x What is a corner point?

Combinatorial Algorithms

The Structure of Polyhedra Gabriel Indik March 2006 CAS 746 – Advanced Topics in Combinatorial Optimization.

Delaunay Triangulation Computational Geometry, WS 2006/07 Lecture 11 Prof. Dr. Thomas Ottmann Algorithmen & Datenstrukturen, Institut für Informatik Fakultät.

1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.

Lecture 14 – Neural Networks

Machine Learning Week 2 Lecture 2.

Affine Partitioning for Parallelism & Locality Amy Lim Stanford University

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Lecture 10 : Delaunay Triangulation Computational Geometry Prof. Dr. Th. Ottmann 1 Overview Motivation. Triangulation of Planar Point Sets. Definition.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

Approximation Algorithms

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Parallelizing Compilers Presented by Yiwei Zhang.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.

Data Dependences CS 524 – High-Performance Computing.

Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

C&O 355 Lecture 2 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A.

Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Linear Programming System of Linear Inequalities  The solution set of LP is described by Ax  b. Gauss showed how to solve a system of linear.

Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed

Memory Allocations for Tiled Uniform Dependence Programs Tomofumi Yuki and Sanjay Rajopadhye.

Minimizing Stall Time in Single Disk Susanne Albers, Naveen Garg, Stefano Leonardi, Carsten Witt Presented by Ruibin Xu.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

Linear Programming (Convex) Cones  Def: closed under nonnegative linear combinations, i.e. K is a cone provided a 1, …, a p  K  R n, 1, …, p.

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.

CR18: Advanced Compilers L01 Introduction Tomofumi Yuki.

CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.

CPSC 536N Sparse Approximations Winter 2013 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA.

OR Chapter 7. The Revised Simplex Method  Recall Theorem 3.1, same basis  same dictionary Entire dictionary can be constructed as long as we.

Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.

CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki.

CR18: Advanced Compilers L02: Dependence Analysis Tomofumi Yuki 1.

Hon Wai Leong, NUS (CS6234, Spring 2009) Page 1 Copyright © 2009 by Leong Hon Wai CS6234: Lecture 4  Linear Programming  LP and Simplex Algorithm [PS82]-Ch2.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015.

10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Common Intersection of Half-Planes in R 2 2 PROBLEM (Common Intersection of half- planes in R 2 ) Given n half-planes H 1, H 2,..., H n in R 2 compute.

Linear Programming Chap 2. The Geometry of LP  In the text, polyhedron is defined as P = { x  R n : Ax  b }. So some of our earlier results should.

Computational Geometry

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

Dependence Analysis and Loops CS 3220 Spring 2016.

Computation of the solutions of nonlinear polynomial systems

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Support Vector Machines Introduction to Data Mining, 2nd Edition by

A Unified Framework for Schedule and Storage Optimization

(Convex) Cones Def: closed under nonnegative linear combinations, i.e.

Optimizing single thread performance

Chapter 6. Large Scale Optimization

Presentation transcript:

CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1

Today’s Agenda Revisiting legality with schedules How to find schedules 2

Schedules Recall that we had many “schedules” here, we use the one related to time In general, a schedule is a function s.t. input: statement instance output: timestamp where instances mapped to the same timestamp “may happen in parallel” We talk about static schedules in this class 3

Legality with Schedule Causality Condition Given a PRDG with nodes N and edges E src(e) = producer statement dst(e) = consumer statement D S = domain of statement node S D e = domain of dependence e Check: 4

Example (uniform case) Back to the legality check with vectors 5 for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i-1][j+1] + B[i][j]; for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] i j θ s (i,j)=i e: (i,j->i+1,j-1) θ s (i+1,j-1)>θ s (i,j) i+1>i

Example (uniform case) Back to the legality check with vectors 6 for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i-1][j+1] + B[i][j]; for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] i j θ s (i,j)=j e: (i,j->i+1,j-1) θ s (i+1,j-1)>θ s (i,j) j-1>j

Example (uniform case) Back to the legality check with vectors 7 for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i-1][j+1] + B[i][j]; for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] i j θ s (i,j)=i-j e: (i,j->i+1,j-1) θ s (i+1,j-1)>θ s (i,j) i-j+2>i-j

Example (affine case) Back to the legality check with vectors 8 for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i][j-1] + A[i-1][M-j]; for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i][j-1] + A[i-1][M-j]; [1,*] i j θ s (i,j)=i+j e: (i,j->i+1,M-j) θ s (i+1,M-j)>θ s (i,j) M+i-j+1>i+j (M+1)/2>j [0,1]

The Scheduling Problem Find θs that satisfy causality conditions i.e., no dependences are violated Connection to loops you can complete the schedule to get the transformation for loops Sometimes, the problem is formulated in terms of the transform instead of schedule 9

Parallel Execution of DO Loops [Lamport 74] One of the 1 st papers on automatic parallelization Hyper-plane method Loops of the form Scope of dependences: uniform + α for I 1 = l 1.. u 1... for I n = l n.. u n body for I 1 = l 1.. u 1... for I n = l n.. u n body 10 for J 1 = λ 1.. μ 1... for J k = λ k.. μ k forall J k+1 = λ k+1.. μ k+1... forall J n = λ n.. μ n body for J 1 = λ 1.. μ 1... for J k = λ k.. μ k forall J k+1 = λ k+1.. μ k+1... forall J n = λ n.. μ n body

The Hyper-Plane Method The main theorem (simplified) We are looking for a schedule θ, such that the inner n-1 loops are parallel θ is restricted to linear θ=a 1 I a n I n The key idea: given a distance vector c we want θ (c)>0 proof of existence for lex. positive c in paper 11

The Hyper-Plane Method Optimizing the schedule What should be the objective function? In this paper, it is min(μ 1 -λ 1 ) which is min(θ’(μ 1 -λ 1 )) θ’(x)=|a 1 |x |a n |x n 12 for I 1 = l 1.. u 1... for I n = l n.. u n body for I 1 = l 1.. u 1... for I n = l n.. u n body for J 1 = λ 1.. μ 1 forall J 2 = λ 2.. μ 2... forall J n = λ n.. μ n body for J 1 = λ 1.. μ 1 forall J 2 = λ 2.. μ 2... forall J n = λ n.. μ n body

Example 1 With distance vectors [1,0] [0,1] θ(i,j)=ai+bj Constraints θ([1,0])>0 : a>0 θ([0,1])>0 : b>0 Minimize Ni+Mj for 0≤i<N, 0≤j<M 13 i j

Example 2 With distance vectors [1,-1] [0,1] θ(i,j)=ai+bj Constraints θ([1,-1])>0: a>b θ([0,1])>0 : b>0 Minimize Ni+Mj for 0≤i<N, 0≤j<M 14 i j

The General Plane Method Generalizing the Hyper-Plane method When the dependences are no longer uniform Given the iteration vector x, Hyper-Plane method is for array accesses: VAR[p(x)+c] where p is a permutation common to the entire body General-Plane method extends to: VAR[d(p(x)+c)] where d “drops” some number of dimensions 15

Final Words on this Paper Very earlier paper, but it does dependence analysis scheduling loop transformation / code generation Similar technique by Wolf & Lam for direction vectors (1991) 16

Farkas Scheduling [Feautrier 92] Given a PRDG find a schedule θ s for each statement S θis restricted to affine functions Affine form of Farkas Lemma given a domain D = Ax+b≥0 an affine form ψ(x) is non-negative in D iff it can be described as positive combination 17 Farkas Multiplier

Problem Formulation Given a PRDG with nodes N and edges E Positivity: all schedules starts at 0 Causality: source/destination instance x,y when the dependence is active note: edge is producer to consumer 18

Using Farkas Lemma Given statements S1 and S2 with schedules θ S1, θ S2 and a dependence e (from S1 to S2) We want to make sure θ S2 (y)>θ S1 (x) for all in D e which is θ S2 (y)-θ S1 (x)-1≥0 in D e make it a single function to get ψ e (x,y)≥0 in D e 19

The Farkas Method Build constraints on the schedule build ψ e (x,y) for each e each ψ constraints the Farkas multipliers solve! 20

Example 1 Consider the following a D S0 : {[i,j] : 0≤i≤N and 0≤j<i } D S1 : {[i] : 0≤i≤N} e1: S0[i,j]->S0[i,j-1] e2: S1[i]->S0[i,i-1] 21 for (i=0.. N) { for (j=0.. i-1) S0: x[i] = x[i] – L[i,j]*x[j]; S1: x[i] = x[i] / L[i,j]; } for (i=0.. N) { for (j=0.. i-1) S0: x[i] = x[i] – L[i,j]*x[j]; S1: x[i] = x[i] / L[i,j]; } direction is consumer to producer

Example 2 Consider the following D S0 : {[i] : 0≤i≤N} D S1 : {[i,j] : 0≤i,j≤N} e1: S1[i,j]->S0[i] : j=0 e2: S1[i,j]->S1[i,j-1] : j>0 22 for (i=0.. N) { S0: x[i] = 0; for (j=0.. N) S1: x[i] = x[i] + L[i,j]*b[j]; } for (i=0.. N) { S0: x[i] = 0; for (j=0.. N) S1: x[i] = x[i] + L[i,j]*b[j]; }

Example 3 Back to this example 23 for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i][j-1] + A[i-1][M-j]; for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i][j-1] + A[i-1][M-j]; i j θ s =a 1 i+a 2 j+a 0 e1:(i,j->i,j-1) e2:(i,j->i-1,M-j)

Multi-Dimensional Scheduling One-Dimensional Affine Schedules are not sufficient linearization of lex. order is polynomial (if you have parameters) So we want to find a set of θs for each statement 24

Multi-Dimensional Farkas Formulate the problem just like 1D case each dependence adds constraints But, we allow some to be not satisfied recall causality condition δ< 0 : dependence violation δ= 0 : weakly satisfied δ> 0 : strongly satisfied 25

Greedy Algorithm Given a PRDG with edged E 1. formulate the problem for all edges in E 2. weakly satisfy all of them 3. strongly satisfy as much as possible 4. add the obtained θ to the list 5. remove strongly satisfied edged from E 6. repeat until E is empty The obtained list of θs is your schedule 26

Back to the Example Back to this example 27 for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i][j-1] + A[i-1][M-j]; for (i=1; i<N; i++) for (j=1; j<M; j++) S: A[i][j] = A[i][j-1] + A[i-1][M-j]; i j θ s =a 1 i+a 2 j+a 0 e1:(i,j->i,j-1) e2:(i,j->i-1,M-j)

The Vertex Method Another method for scheduling Uses the generator representation of polyhedra Constraint representation: intersection of half-spaces Generator representation: convex hull of vertices, rays, and lines 28 The Mapping of Linear Recurrence Equations on Regular Arrays, Patrice Quinton and Vincent Van Dogen, 1989

The Main Theorem A schedule legal for the vertices + rays + lines is also legal for the entire polyhedron generated by them you can compute constraints on schedules no need to reason about potentially infinite set of iterations 29

On the Optimality of Scheduling Paper by Alain Darte and Frédéric Vivien Survey of various methods for scheduling what is the dependence abstraction used? what can you say about optimality? Optimality: does the method find all parallelism? how to define “all” parallelism? 30

Scheduling Algorithms Allen and Kennedy [1987] targeting vector machines; dependence-levels Wolf and Lam [1991] Lamport-like; dependence vectors Darte and Vivien [1996] Farkas-like; dependence polyhedra Feautrier [1992] Farkas Algorithm; affine dependences Lim and Lam [1997] 31

Allen and Kennedy (in short) You have dependence-levels only i.e., you know the dimension where the dependence is carried Parallelizes the inner loops with no loop carried dependence this paper introduced dependence levels Also deals with loop fusion if dependence is carried in some outer common loop, it can safely be fused 32

Optimality of Allen and Kennedy The dependence information is very limited dependence-level only Then the parallelism found is actually optimal later proved by Darte and Vivien 33

Wolf and Lam (in short) Input: direction vectors Output: fully permutable loops what does this mean? Context: unimodular transformations Optimal parallelism extraction if you only know direction vectors perfectly nested loops 34

Optimality of Farkas Algorithm Original paper had no claims later proved by Darte and Vivien The Greedy algorithm is actually optimal! With a few caveats affine schedules one schedule per statement 35

Index-Set Splitting Piece-wise affine schedule or split a statement into multiple statements or split an equation into... Main Idea: using one schedule for the entire statement is (sometimes) not optimal 36

Example: Smashing Periodic Boundaries can you tile? 37 i j

Example: Smashing Periodic Boundaries 38 i j i j

How Good is Optimal What does Farkas scheduling bring? 39