Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering The Ohio State University 2 Department of Electrical and Computer Engineering Louisiana State University

Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Single-processor performance Improved by ~50%/yr for almost two decades Clock speed, ILP, … Clock speed increased over 100x Limits to single-processor performance growth Increase in power density Flattening of clock speed due to power limitation Transistor density continues to rise unabated Multiple cores are now the best option for sustained performance growth Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Need to optimize memory bandwidth and latency in multi-core architectures Traditional solution: introducing a cache hierarchy Drawback Caches are hardware-managed - difficult to model miss behavior and to predict program execution times Solution in many modern architectures: fast on- chip explicitly managed memory - scratchpad memory (local memory store) Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Scratchpads Software-managed Control over data movement Easier to model performance Burden on programmer/compiler to manage and utilize Lower power per chip area required compared to cache Some modern architectures having scratchpad memories GPU Cell MPSoC Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Effective management of on-chip scratchpads in multi-core architectures Utilize limited capacity of scratchpad Optimize data movement Effective computation mapping in many-core architectures with multiple levels of parallelism Exploit available parallelism Account for scratchpad capacity constraints Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Orchestration of data movement between off-chip global and on-chip scratchpad memory Decisions on What data elements to move in and out of scratchpad When to move data How to move data How to access the data elements copied to scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

1.Allocation of storage space (as arrays) in the scratchpad memory for local copies 2.Determination of access functions of arrays in scratchpad memories 3.Generation of code for moving data between scratchpad (local) and off-chip (global) memories Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Targeted at affine programs Dense arrays Loop bounds – affine functions of outer loop variables, constants and program parameters Array access functions - affine functions of surrounding loop variables, constants and program parameters Developed using polyhedral model an algebraic framework for representing affine programs – statement domains, dependences, array access functions – and affine program transformations Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Polyhedral Model Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 for (i=1; i<=4; i++) for (j=2; j<=4; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; ijij x S1 =. 0 -1 4 I S1 = ij1ij1 0 -1 0 4 0 1 -2 1 0 -1 (0,0) (m,m) 1a (x S1 ) = 1 0 0 1. ijij + 0 DS 1a = 1a I S1 2a (x S1 ) = 0 1 1 0. ijij + 0 3a (x S1 ) = 1 0 0 1. ijij + 0 -1 j i1i1 i4i4 i j2j2j4

Given a program block, identify the storage space needed for each non-overlapping accessed region of all arrays Access functions of array references may be non- uniformly generated For architectures (e.g. nVIDIA GeForce GPU) supporting direct data access from off-chip memory Estimate extent of reuse of data to determine whether or not to copy to scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Array A 10 14 20 28 11 20 for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1] *3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } Local Array LA0: lb ( i ) = 10; ub( i ) = 14 lb ( j ) = 11; ub( j ) = 20 Local Array LA1: lb ( i ) = 20; ub( i ) = 28 lb ( j ) = 11; ub( j ) = 15 Algorithm and Illustration Find the set of all data spaces accessed by all references to an array Access function of the reference Iteration space of the statement that holds the reference Partition the set of all data spaces into maximal disjoint non-overlapping subset of data spaces Local memory array for each bounding box Find the bounding box of each partition of data spaces

Array dimension in scratchpad may be lower than original array dimension, depending on accessed data Access function in local memory array Original access function or reduced access function with offsets – lower bounds (in each dimension) of scratchpad array Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1]*3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { LA0[i-10][j+1-11] = LA1[i+j-20][j+1-11]*3; for (k=11;k<=20;k++) LB0[i-10][j+k-21] = LA0[i-10][k-11] + LB1[i+j-20][k-11]; }

Generation of loop structure Scanning of polytopes (using CLooG - a tool for code generation) corresponding to data spaces of read references – for moving data into scratchpad write references – for moving data out of scratchpad Generation of loop body (data movement statement) Copy from a location in scratchpad buffer to off-chip memory location or vice versa Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 /* Data Move in code */ for (i=10;i<=14;i++) { for (j=11;j<=20;j++) LA0[i-10][j-11] = A[i][j] ; } for (i=20;i<=28;i++) { for (j=max(i-13,11);j<=min(15,i- 9); j++) LA1[i-20][j-11] = A[i][j] ; } /* Data Move out code */ for (i=10;i<=14;i++) { for (j=11;j<=15;j++) A[i][j] = LA0[i-10][j-11]; }

Architectural components Slow off-chip (global) memory Two levels of parallelism Set of multiprocessors Set of processor cores in each multiprocessor Scratchpad on each multiprocessor, shared by its processor cores GPU architecture Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 Off-chip memory... Scratchpad

Tiling transformation framework recently developed at OSU by Bondhugula (CC-08, PLDI-08) Finds tiling transformations or hyperplanes for sequences of imperfectly nested loops enables communication minimal parallelization and locality optimization Identifies loops to tile for parallelism and data locality Multiple levels of tiling for exploiting parallelism across multiple parallel levels Additional tiling (sequential) at each level with scratchpad memory If data required by tile executing at the level exceeds memory Data movement at the start and end of each sequential tile Synchronization points to ensure consistency Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

FORALL i = 1, Ni FORALL j = 1, Nj FOR k = 1, WS FOR l = 1, WS S1 END FOR END FORALL END FOR // Tiling to satisfy scratchpad memory limit FOR i' = iT, min(iT+Ti-1,Ni), ti' FOR j' = jT, min(jT+Tj-1,Nj), tj' FOR k' = 1, WS, tk' FOR l'= 1, WS, tl' // Tiling to distribute at the outer level FORALL iT = 1, Ni, Ti FORALL jT = 1, Nj, Tj FOR i = it, min(it+ti-1,Ni) FOR j = jt, min(jt+tj-1,Nj) FOR k = k', min(k'+tk'-1,WS) FOR l = l', min(l'+tl'-1,WS) S1 END FOR // Tiling to distribute at the inner level FORALL it = i', min(i'+ti'-1,Ni), ti FORALL jt = j', min(j'+tj'-1,Nj), tj END FORALL

Handling scratchpad memory constraints Cost model for data movement C = N x (S + (V x L)/P) N – Number of data movements S – Sync cost per data movement V – Number of elements per data movement (based on tile sizes) L – Cost to transfer one element P – Number of processes involved in data movement Tile size search formulation Constraint: memory requirement within limit Objective function: minimize data movement cost, C Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Loop nest of m loops with tile sizes t 1, t 2,.., t m nl local arrays M j – Memory (as a function of tile sizes) for local array j V inj and V outj – Volume (as a function of tile sizes) moved in to and out of local array memory j, respectively r j – position in the loop nest where the data movement code of array j is placed M up - total scratchpad memory Memory Constraint: Objective function: Variables:t 1, t 2,.., t m

Motion Estimation Kernel (1/2) Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad

1D Jacobi Kernel (1/2) Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Tile size from model

Scratchpad memory management Data reuse - Issenin et al. [DAC06] Allocation for uniformly generated references Schreiber and Cronquist [HPLTR04] Anantharaman and Pande [RTSS98] Kandemir et al. [CAD04] Improving performance on cached architectures Ferrante et al. [LCPC92] Gallivan et al. [ICS88] Multi-level tiling Fatahalian et al. [SC06]– various levels of memory Bikshandi et al. [PPOPP06] and Renganarayanan et al. [SC07, IPDPS07] – parallelism and locality Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Addressed two issues in compiling for modern multi-level parallel architectures with scratchpads 1.Data management in scratchpad memory 1.Data allocation 2.Access in scratchpad 3.Code generation for data movement 2.Mapping of computation in regular programs on to multiple levels of parallel units Experimental evaluation using nVIDIA GPU Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Developing an end-to-end compiler framework for modern many-core architectures like GPUs Algorithms developed in this work – an integral part of the overall compiler framework Further optimize transformations like tiling, for modern architectures like GPUs, using model- driven empirical search Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Similar presentations

Presentation on theme: "Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Similar presentations

Presentation on theme: "Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula."— Presentation transcript:

Similar presentations

About project

Feedback