Download presentation

Presentation is loading. Please wait.

Published byZoe McLain Modified over 3 years ago

1
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering The Ohio State University 2 Department of Electrical and Computer Engineering Louisiana State University

2
Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

3
Single-processor performance Improved by ~50%/yr for almost two decades Clock speed, ILP, … Clock speed increased over 100x Limits to single-processor performance growth Increase in power density Flattening of clock speed due to power limitation Transistor density continues to rise unabated Multiple cores are now the best option for sustained performance growth Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

4
Need to optimize memory bandwidth and latency in multi-core architectures Traditional solution: introducing a cache hierarchy Drawback Caches are hardware-managed - difficult to model miss behavior and to predict program execution times Solution in many modern architectures: fast on- chip explicitly managed memory - scratchpad memory (local memory store) Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

5
Scratchpads Software-managed Control over data movement Easier to model performance Burden on programmer/compiler to manage and utilize Lower power per chip area required compared to cache Some modern architectures having scratchpad memories GPU Cell MPSoC Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

6
Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

7
Effective management of on-chip scratchpads in multi-core architectures Utilize limited capacity of scratchpad Optimize data movement Effective computation mapping in many-core architectures with multiple levels of parallelism Exploit available parallelism Account for scratchpad capacity constraints Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

8
Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

9
Orchestration of data movement between off-chip global and on-chip scratchpad memory Decisions on What data elements to move in and out of scratchpad When to move data How to move data How to access the data elements copied to scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

10
1.Allocation of storage space (as arrays) in the scratchpad memory for local copies 2.Determination of access functions of arrays in scratchpad memories 3.Generation of code for moving data between scratchpad (local) and off-chip (global) memories Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

11
Targeted at affine programs Dense arrays Loop bounds – affine functions of outer loop variables, constants and program parameters Array access functions - affine functions of surrounding loop variables, constants and program parameters Developed using polyhedral model an algebraic framework for representing affine programs – statement domains, dependences, array access functions – and affine program transformations Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

12
Polyhedral Model Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 for (i=1; i<=4; i++) for (j=2; j<=4; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; ijij x S1 =. 0 -1 4 I S1 = ij1ij1 0 -1 0 4 0 1 -2 1 0 -1 (0,0) (m,m) 1a (x S1 ) = 1 0 0 1. ijij + 0 DS 1a = 1a I S1 2a (x S1 ) = 0 1 1 0. ijij + 0 3a (x S1 ) = 1 0 0 1. ijij + 0 -1 j i1i1 i4i4 i j2j2j4

13
Given a program block, identify the storage space needed for each non-overlapping accessed region of all arrays Access functions of array references may be non- uniformly generated For architectures (e.g. nVIDIA GeForce GPU) supporting direct data access from off-chip memory Estimate extent of reuse of data to determine whether or not to copy to scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

14
Array A 10 14 20 28 11 20 for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1] *3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } Local Array LA0: lb ( i ) = 10; ub( i ) = 14 lb ( j ) = 11; ub( j ) = 20 Local Array LA1: lb ( i ) = 20; ub( i ) = 28 lb ( j ) = 11; ub( j ) = 15 Algorithm and Illustration Find the set of all data spaces accessed by all references to an array Access function of the reference Iteration space of the statement that holds the reference Partition the set of all data spaces into maximal disjoint non-overlapping subset of data spaces Local memory array for each bounding box Find the bounding box of each partition of data spaces

15
Array dimension in scratchpad may be lower than original array dimension, depending on accessed data Access function in local memory array Original access function or reduced access function with offsets – lower bounds (in each dimension) of scratchpad array Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1]*3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { LA0[i-10][j+1-11] = LA1[i+j-20][j+1-11]*3; for (k=11;k<=20;k++) LB0[i-10][j+k-21] = LA0[i-10][k-11] + LB1[i+j-20][k-11]; }

16
Generation of loop structure Scanning of polytopes (using CLooG - a tool for code generation) corresponding to data spaces of read references – for moving data into scratchpad write references – for moving data out of scratchpad Generation of loop body (data movement statement) Copy from a location in scratchpad buffer to off-chip memory location or vice versa Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 /* Data Move in code */ for (i=10;i<=14;i++) { for (j=11;j<=20;j++) LA0[i-10][j-11] = A[i][j] ; } for (i=20;i<=28;i++) { for (j=max(i-13,11);j<=min(15,i- 9); j++) LA1[i-20][j-11] = A[i][j] ; } /* Data Move out code */ for (i=10;i<=14;i++) { for (j=11;j<=15;j++) A[i][j] = LA0[i-10][j-11]; }

17
Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

18
Architectural components Slow off-chip (global) memory Two levels of parallelism Set of multiprocessors Set of processor cores in each multiprocessor Scratchpad on each multiprocessor, shared by its processor cores GPU architecture Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 Off-chip memory... Scratchpad

19
Tiling transformation framework recently developed at OSU by Bondhugula (CC-08, PLDI-08) Finds tiling transformations or hyperplanes for sequences of imperfectly nested loops enables communication minimal parallelization and locality optimization Identifies loops to tile for parallelism and data locality Multiple levels of tiling for exploiting parallelism across multiple parallel levels Additional tiling (sequential) at each level with scratchpad memory If data required by tile executing at the level exceeds memory Data movement at the start and end of each sequential tile Synchronization points to ensure consistency Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

20
FORALL i = 1, Ni FORALL j = 1, Nj FOR k = 1, WS FOR l = 1, WS S1 END FOR END FORALL END FOR // Tiling to satisfy scratchpad memory limit FOR i' = iT, min(iT+Ti-1,Ni), ti' FOR j' = jT, min(jT+Tj-1,Nj), tj' FOR k' = 1, WS, tk' FOR l'= 1, WS, tl' // Tiling to distribute at the outer level FORALL iT = 1, Ni, Ti FORALL jT = 1, Nj, Tj FOR i = it, min(it+ti-1,Ni) FOR j = jt, min(jt+tj-1,Nj) FOR k = k', min(k'+tk'-1,WS) FOR l = l', min(l'+tl'-1,WS) S1 END FOR // Tiling to distribute at the inner level FORALL it = i', min(i'+ti'-1,Ni), ti FORALL jt = j', min(j'+tj'-1,Nj), tj END FORALL

21
Handling scratchpad memory constraints Cost model for data movement C = N x (S + (V x L)/P) N – Number of data movements S – Sync cost per data movement V – Number of elements per data movement (based on tile sizes) L – Cost to transfer one element P – Number of processes involved in data movement Tile size search formulation Constraint: memory requirement within limit Objective function: minimize data movement cost, C Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

22
Loop nest of m loops with tile sizes t 1, t 2,.., t m nl local arrays M j – Memory (as a function of tile sizes) for local array j V inj and V outj – Volume (as a function of tile sizes) moved in to and out of local array memory j, respectively r j – position in the loop nest where the data movement code of array j is placed M up - total scratchpad memory Memory Constraint: Objective function: Variables:t 1, t 2,.., t m

23
Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

24
Motion Estimation Kernel (1/2) Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad

25
1D Jacobi Kernel (1/2) Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad

26
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Tile size from model

27
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008 Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Tile size from model

28
Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

29
Scratchpad memory management Data reuse - Issenin et al. [DAC06] Allocation for uniformly generated references Schreiber and Cronquist [HPLTR04] Anantharaman and Pande [RTSS98] Kandemir et al. [CAD04] Improving performance on cached architectures Ferrante et al. [LCPC92] Gallivan et al. [ICS88] Multi-level tiling Fatahalian et al. [SC06]– various levels of memory Bikshandi et al. [PPOPP06] and Renganarayanan et al. [SC07, IPDPS07] – parallelism and locality Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

30
Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

31
Addressed two issues in compiling for modern multi-level parallel architectures with scratchpads 1.Data management in scratchpad memory 1.Data allocation 2.Access in scratchpad 3.Code generation for data movement 2.Mapping of computation in regular programs on to multiple levels of parallel units Experimental evaluation using nVIDIA GPU Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

32
Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

33
Developing an end-to-end compiler framework for modern many-core architectures like GPUs Algorithms developed in this work – an integral part of the overall compiler framework Further optimize transformations like tiling, for modern architectures like GPUs, using model- driven empirical search Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Similar presentations

OK

Joram Benham April 2, 2012. Introduction Motivation Multicore Processors Overview, CELL Advantages of CMPs Throughput, Latency Challenges.

Joram Benham April 2, 2012. Introduction Motivation Multicore Processors Overview, CELL Advantages of CMPs Throughput, Latency Challenges.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on teamviewer free Free download ppt on food security in india Ppt on hydro power plant Ppt on refraction of light through prism Ppt on grid connected pv systems Ppt on going places Ppt on basics of ms office Ppt on call center training Ppt on exchange rate system Ppt on email etiquettes