Presentation is loading. Please wait.

Presentation is loading. Please wait.

CR18: Advanced Compilers L01 Introduction Tomofumi Yuki.

Similar presentations


Presentation on theme: "CR18: Advanced Compilers L01 Introduction Tomofumi Yuki."— Presentation transcript:

1 CR18: Advanced Compilers L01 Introduction Tomofumi Yuki

2 Myself Tomofumi Yuki researcher at Inria Ph.D. from Colorado State University in 2012 up to high school in Japan CSU for all of bachelor, masters, phd Member of Compsys @ LIP compilers/languages automatic parallelization 2

3 This Course Part I: High-level (loop-level) transformations parallelism data locality Part II: High-Level Synthesis C to hardware 3

4 Compiler Optimizations Low-level Optimizations register allocation instruction scheduling constant propagation... High-level Optimizations loop transformations coarse grained parallelism... 4 Our focus

5 High-Level Optimizations Goals: Parallelism and Data Locality Why Parallelism? Why Data Locality? Why High-Level? 5

6 Why Loop Transformations? The 90/10 Rule Loop Nests hotspot of almost all programs few lines of change => huge impact natural source of parallelism 6 “90% of the execution time is spent in less than 10% of the source code”

7 Why Loop Transformations? Which is faster? 7 for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j];

8 Why is it Faster? Hardware Prefetching 8 for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; unchangednext col next row unchangednext col

9 How to Automate? The most challenging part! The same optimization doesn’t work for: Why? 9 for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) { C1[i][j] += A1[i][k] * B1[k][j]; C2[i][j] += A2[i][k] * B2[k][j]; C3[i][j] += A3[i][k] * B3[k][j]; C4[i][j] += A4[i][k] * B4[k][j]; } for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) { C1[i][j] += A1[i][k] * B1[k][j]; C2[i][j] += A2[i][k] * B2[k][j]; C3[i][j] += A3[i][k] * B3[k][j]; C4[i][j] += A4[i][k] * B4[k][j]; }

10 It’s Not Just Transformations Many many reasoning steps: What to apply? How to apply? When to apply? What is its impact? Quality of the analysis: How long does it take? Can it potentially degrade performance? Provable properties (completeness, etc.) 10 Compiler Research is all about coming up with techniques/abstractions/representations to allow the compiler to perform deep analysis. Compiler Research is all about coming up with techniques/abstractions/representations to allow the compiler to perform deep analysis.

11 Today’s Agenda The Big Picture programming language compilers Basic Concepts iteration space and loop nests polyhedral domains and functions parametric integer programming Short history of polyhedral model 11

12 Compiler Advances Old compiler vs recent compiler modern architecture different versions of gcc How much speedup by compiler alone after 20 years of research? 12

13 Compiler Advances Old compiler vs recent compiler modern architecture different versions of gcc 2x difference after 20 years (anecdotal) Not so much? 13

14 Compiler Advances Old compiler vs recent compiler modern architecture different versions of gcc 2x difference after 20 years (anecdotal) Not so much? 14 “The most remarkable accomplishment by far of the compiler field is the widespread use of high-level languages.” by Mary Hall, David Padua, and Keshav Pingali [Compiler Research: The Next 50 Years, 2009]

15 Placement of Compiler Research Part of Programming Languages 15 compiler runtime systems program verification type theory program synthesis program analysis program trans.

16 Earlier Accomplishments Getting efficient assembly register allocation instruction scheduling... High-level language features object-orientation dynamic types automated memory management... 16

17 New twists New machines SIMD, IBM Cell, GPGPU, Xeon-phi New language features even Java has lambda functions now parallelism oriented features New types of Apps smartphones, tablets New goals energy and security 17

18 Recent research topics Parallelism multi-cores, GPUs,... language features for parallelism Security/Reliability verification certified compilers Power/Energy data movement voltage scaling 18

19 Goals of the Compiler Higher abstraction No more writing assemblies! enables language features loops, functions, classes, aspects,... Performance while increasing productivity speed, space, energy,... compiler optimizations 19 Personal View: Compiler is there to allow lazy programming Personal View: Compiler is there to allow lazy programming

20 Job Market Where do they work at? IBM Mathworks amazon start-ups Apple Many opportunities in France Mathworks @ Grenoble Many start-ups 20

21 Today’s Agenda The Big Picture programming language compilers Basic Concepts iteration space and loop nests polyhedral domains and functions parametric integer programming Short history of polyhedral model 21

22 Program IR Abstract Syntax Tree basic representation within compilers how to inspect the AST to determine if a loop is parallel? 22 for (i in 1..N) A[i] = B[i] + 1; for (i in 1..N) A[i] = B[i] + 1; NodeFor iterator=i, LB=1, UB=N NodeFor iterator=i, LB=1, UB=N NodeAssignment A[i] B[i] 1 1 NodeBinOp op=+ NodeBinOp op=+ Not really suitable for high-level analysis

23 Extended Graphs Completely unroll the loops 23 for (i=0; i<5; i++) for (j=1; j<4; j++) { A[i][j] = A[i][j-1] + B[i][j]; } for (i=0; i<5; i++) for (j=1; j<4; j++) { A[i][j] = A[i][j-1] + B[i][j]; } A[0][1] = A[0][0] + B[0][1]; A[0][2] = A[0][1] + B[0][2]; A[0][3] = A[0][2] + B[0][3]; A[1][1] = A[1][0] + B[1][1]; A[1][2] = A[1][1] + B[1][2]; A[1][3] = A[1][2] + B[1][3];.... A[0][1] = A[0][0] + B[0][1]; A[0][2] = A[0][1] + B[0][2]; A[0][3] = A[0][2] + B[0][3]; A[1][1] = A[1][0] + B[1][1]; A[1][2] = A[1][1] + B[1][2]; A[1][3] = A[1][2] + B[1][3];....

24 Extended Graphs Completely unroll the loops The difficulty: program parameters its “easy” with DAG representation scalability issues what if parameters are not known? 24 for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; }

25 Iteration Spaces Need an abstraction for statement instances 25 for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } i j instance = integer vector [i,j] space = integer set 0≤i<N and 1≤j<M

26 Lexicographic Order Dictionary order applied to loop nests a aaa aab aba aaaa b Compare instances (i,j) is before(i’,j’) i<i’ or i=i’ and j<j’ 26 i j for (i=1; i<N; i++) for (j=1; j<M; j++) S0; for (i=1; i<N; i++) for (j=1; j<M; j++) S0;

27 What is the Polyhedral Model? It Depends (on who you ask) If you ask me... Compiler Intermediate Representation (IR) linear algebra based compact representation takes advantage of regularities 27

28 Polyhedral Representation High-level abstraction of the program Iteration space: integer polyhedron Dependences: affine functions Usual optimization flow 1. extract polyhedral representation 2. reason/transform the model 3. generate code in the end 28

29 Polyhedral Domains Statements instances as integer polyhedra Example: N 2 /2 instances of S0 Denoted as S0 Represented as polyhedron {i,j|1≤i<N, 1≤j≤i} Geometric view 29 for (i=1; i<N; i++) for (j=1; j<=i; j++) S0; for (i=1; i<N; i++) for (j=1; j<=i; j++) S0; i j i<N j≤ij≤i 1≤j1≤j 1≤i1≤i

30 Examples (Domains) What are the domain of these statements? 30 for (i=0; i<=N; i++) { for (j=0; 0<=M; j++) { S1; } S2; } for (i=0; i<=N; i++) { for (j=0; 0<=M; j++) { S1; } S2; } for (i=0; i<=N; i++) { for (j=M; j>=0; j--) { S1; } for (i=0; i<=N; i++) { for (j=M; j>=0; j--) { S1; } for (i=0; i<=N; i++) { for (j=0; j<=M; j+=2) { S1; } for (i=0; i<=N; i++) { for (j=0; j<=M; j+=2) { S1; } for (i=0; i<=N; i++) { for (j=0; j<=M; j++) { if (j>i) S1; } for (i=0; i<=N; i++) { for (j=0; j<=M; j++) { if (j>i) S1; }

31 Z-Polyhedron Polyhedron with holes intersection with lattices image of domain by affine function Just a polyhedron in higher dimensional space 31 0<=i<=N and i%2=0 0<=i<=N and i=2j i j 2 1

32 Dependence Functions Affine functions over statement instances Dataflow (i,j→i,j+1) Dependence (i,j→i,j-1) 32 for (i=1; i<N; i++) for (j=1; j<M; j++) S0: A[i][j] = A[i][j-1]; for (i=1; i<N; i++) for (j=1; j<M; j++) S0: A[i][j] = A[i][j-1]; i j

33 Dependence Functions Dependences can be domain qualified Dataflow if j=M-1 (i,j→i+1,1) else (i,j→i,j+1) 33 for (i=1; i<N; i++) for (j=1; j<M; j++) S0: v++; for (i=1; i<N; i++) for (j=1; j<M; j++) S0: v++; i j

34 Loop Transformations Also affine functions loop permutation: (i,j -> j,i) loop skewing: (i,j -> i,j+i) Affine loops + affine transformation permits linear programming one of the few successes in parallelization 34 for (i=0; i<N; i++) for (j=0; j<M; j++) S; for (i=0; i<N; i++) for (j=0; j<M; j++) S; for (i=0; i<N; i++) for (j=i; j<M+i; j++) S’; for (i=0; i<N; i++) for (j=i; j<M+i; j++) S’;

35 Composing Transformations Key strength of the framework 35 for i for j... for i for j... for j for i... for j for i... for j for i’... for i’’... for j for i’... for i’’... T1 T2 poly poly’ loop world abstraction

36 Parametric Analysis Real-world code is filled with parameters code for NxM matrix, not 100x200 If the code is not parametric, and compilation time is not a big deal, it is an “easy” problem Dealing with (potentially) infinitely different executions of a program 36

37 What is the last iteration? Key analysis What is the instance that last wrote to A[k]? Can be formulated as an ILP 0<i<N, 0<j<=i, i+j=k find lexicographically maximum k many analysis questions become ILP for regular programs 37 for (i=1; i<N; i++) for (j=1; j<=i; j++) S0: A[i+j] =...; for (i=1; i<N; i++) for (j=1; j<=i; j++) S0: A[i+j] =...;

38 Parametric Integer Programming Constraints j≤10, i+j≤10 j-i≤N i,j≥0, N>0 Objective maximize j Parametric Solution (0,N) if N≥10 (N,N) if N<10 38 maximize j≤10 j-i≤N i+j≤10

39 Parametric Integer Programming Constraints j≤10, i+j≤10 j-i≤N i,j≥0, N>0 Objective maximize j Parametric Solution (0,N) if N≥10 N-j+i≥0 (N,N) if N<10 N-j+i<0 39 maximize j≤10 j-i≤N i+j≤10 2. Create branches for each case 1. Look at the sign of constraints

40 Today’s Agenda The Big Picture programming language compilers Basic Concepts iteration space and loop nests polyhedral domains and functions parametric integer programming Short history of polyhedral model 40

41 History of the Polyhedral Model Also layout for Part I of the class Keep in mind history is not objective 41

42 Origins of the Polyhedral Model Two Starting Points Loop program analysis Systems of recurrence equations Loop-view is this loop parallel? what are the dependences? Equational-view is this system of equations executable? how to find legal schedules? 42

43 Polyhedral Timeline 43 recurrence equations systolic arrays loop dependence analysis loop transformation 197019902000 Array Dataflow Analysis 1991 Parametric Integer Programming 1988 Scheduling Code Generation Memory Allocation multi-core GPGPU Distributed Memory

44 Polyhedral Model: Short Story 44 Pluto (2008) Cloog (2003) Polylib, PIP (early 90s) Multi-core GPU MPSoc FPGA VLSI Automatic parallelization for shared and distributed memory machines Multi-dimensional Process Networks for System Level Design Loop transformations for HLS Multi-core era Memory optimization for embedded multimedia From a (very) subjective point of view … (originally by Steven Derrien) Massively parallel Processor Arrays

45 Polyhedral Equational Model Idea: Map computations to code/hardware computations specified as equations Example: Matrix Multiply 45 for i in 0.. P for j in 0.. Q for k in 0.. R C[i][j] += A[i][k] * B[k][j]; for i in 0.. P for j in 0.. Q for k in 0.. R C[i][j] += A[i][k] * B[k][j]; C[i,j,k] = A[i,k] * B[k,j] : if k=0 = A[i,k] * B[k,j] + C[i,j,k-1] : if k>0 C[i,j,k] = A[i,k] * B[k,j] : if k=0 = A[i,k] * B[k,j] + C[i,j,k-1] : if k>0 C[i,j] = Σ k (A[i,k]*B[k,j]);

46 The Connection Array Dataflow Analysis [Feautrier 1991] convert loops to equations limited to affine loops domain: {[i,j,k]:0≤i≤P ∧ 0≤j≤Q ∧ 0≤k≤R} dependences: S0 → S0 dataflow: (i,j,k→i,j,k+1) 46 for i in 0.. P for j in 0.. Q for k in 0.. R S0: C[i][j] += A[i][k] * B[k][j]; for i in 0.. P for j in 0.. Q for k in 0.. R S0: C[i][j] += A[i][k] * B[k][j];

47 Next Time Dependence Analysis Array Dataflow Analysis Legality of transformations 47


Download ppt "CR18: Advanced Compilers L01 Introduction Tomofumi Yuki."

Similar presentations


Ads by Google