Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

Similar presentations


Presentation on theme: "Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,"— Presentation transcript:

1 Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM, India + Universidade da Coruna, Spain

2 2 Motivation The importance of tiles A natural way to express many algorithms Partitioning data is an effective way to enhance locality Pervasive in parallel computing Limitations of today s programming language Lack of programming construct to express tiles directly Complicated indexing and loop structure to traverse an array by tiles No support for storage by tile. Mixed the design and detailed implementation.

3 3 Contributions Our group developed Hierarchically Tiled Arrays (HTAs) to support tiles to enhance locality and parallelism ( PPoPP 06 ). Designed new language constructs based on our true experiences. Dynamic partitioning Overlapped tiling Evaluated both the productivity and performance

4 4 Outline Introduction of HTA Dynamic partitioning Overlapped tiling Impact on programming productivity Related work Conclusions

5 5 HTA overview A A(0,0) A(1, 0:1) A(0,1)[0,1] op A + B A(0, 0:1) = B (1, 0:1)

6 6 HTA library implementation In Matlab In C++ Support sequential and parallel execution on top of MPI and TBB. Support for linear, non-linear data layouts Execution model SPMD Programming model Single threaded

7 7 Outline Introduction of HTA Dynamic partitioning Overlapped tiling Impact on programming productivity Related work Conclusions

8 8 Why dynamic partitioning? Some linear algebra algorithms Add and remove partitions to sweep through matrices. Cache oblivious algorithms Create the partition following a divide and conquer strategy

9 9 Syntax and semantics Partition lines and partition Two methods void part( Tuple sourcePartition, Tuple offset ); void rmPart( Tuple partition ); 0 0 1 1 2 20 0 1 1 3 3 2 20 0 2 2 1 1 AA.part((1,1),(2,2))A.rmPart((1,1)) NONE for not partitioning fixed

10 10 LU algorithm with dynamic partitioning A.part((1,1),(nb, nb)) A.rmPart((1,1)) done Partially updated A 11 A 21 A 22 A 12 done Partially updated Beginning of iteration RepartitionUpdate End of iteration In the loop nb

11 11 LU algorithm represented in FLAME [ Geijn, et al ] UPD In the loop The FLAME algorithm

12 12 void lu(HTA A, HTA p,int nb) { A.part((0,0),(0,0)); p.part((0), (0)); while(A(0,0).lsize(1)<A.lsize(1)){ int b = min(A(1,1).lsize(0), nb); A.part((1,1),(b,b)); p.part((1), (b)); dgetf2 (A(1:2,1), p(1)); dlaswp (A(1:2,0), p(1)); dlaswp (A(1:2,2), p(1)); trsm (HtaRight, HtaUpper, HtaNoTrans, HtaUnit, One, A(1,1),A(1,2)); gemm (HtaNoTrans, HtaNoTrans, MinusOne, A(2,1), A(1,2), One, A(2,2)); A.rmPart((1,1)); p.rmPart((1)); } } From algorithm to HTA program The FLAME algorithm The HTA implementation

13 13 FLAME API vs. HTA A.part((0,0), (0,0)); A.part((1,1), (b,b)); A.rmPart((1,1)); A(1:2, 1); HTA: 1). General. 2). Fewer variables. 3). Simple index. 4). Flexible range selection

14 14 A cache oblivious algorithm: parallel merge 22114314303241 18104215232840 in1 in2 1. Take middle element in in1 2. Find lower bound in in2 22 23 out 3. Calculate partition in out 4. Partition (logically) 22114314303241 18104215232840 5. Merge in parallel

15 15 HTA vs. Treading Building Blocks (TBB) HTA codeTBB code HTA: partition tiles and operate on tiles in parallel TBB: prepare the merge range and operate on smaller ranges in parallel map(PMerge(), out, in1, in2); parallel_for(PMergeRange(begin1, end1, begin2, end2, out), PMergeBody());

16 16 Evaluation BenchmarkCategory Execution mode Platform MMMCache oblivious algorithm Sequential 3.0 GHz Intel Pentium 4, 16KB L1, 1MB L2 and 1GB RAM. Recursive LUCache oblivious algorithm Dynamic LU FLAME s algorithm Sylvester FLAME s algorithm Parallel mergeCache oblivious algorithmParallel Two Quad core 2.66 GHz Xeon processors

17 17 Recursive MMM Recursive LU Dynamic LU Sylvester

18 18 Parallel merge

19 19 Outline Introduction of HTA Dynamic partitioning Overlapped tiling Impact on programming productivity Related work Conclusions

20 20 Programs that have computations based on neighboring points E.g. iterative PDE solvers Benefit from tiling Shadow regions Problems with current approach Explicit allocation and update for shadow regions Example: NAS MG (Lines of code) Motivation Versionscomm3ResidualInversionProjection MPI327232449 CAF302252355 OpenMP26272659 HTA33555364

21 21 Overlapped tiling Objective Automate the allocation and update of shadow regions Allow the access of neighbors across neighboring tiles Allow programmer to specify overlapped region at creation time Overlap ol = Overlap (Tuple negativeDir, Tuple positiveDir, BoundaryMode mode, bool autoUpdate=true);

22 22 Example of overlapped tiling Shadow region for Tuple::seq tiling = ((4), (3)); Overlap ol((1), (1), zero); A=HTA ::alloc(1, tiling, array, ROW, ol); 00 overlap Boundary

23 23 Indexing Operations (+,-, map, =, etc) The conformability rules only apply to owned regions Enable operations with non-overlapped HTA. Indexing and operations T[0:3]=T[ALL] T[-1]T[4] owned region

24 24 Shadow region consistency Ensure shadow regions to be properly updated and consistent with the corresponding data in the owned tiles. Use update on read policy Bookkeeping the status of each tile. SPMD mode: no communication is needed. Allow manual and automatic updates.

25 25 Evaluation BenchmarkDescription Execution mode Sequential 3D Jacobi Stencil computation on 6 neighborssequential NAS MGMultigrid V-cycle algorithmparallel NAS LUNavier Strokes equation solverparallel A cluster consisting of 128 nodes each with two 2 GHz G5 processors and 4 GB of RAM. We used one processor per node in our experiments with Myrinet connection. NAS code (Fortran + MPI) was compiled with g77 and HTA code was compiled with g++ (3.3). O3 flag is used for both cases.

26 26 MG comm3 in HTA without overlapped tiling void comm3 (Grid& u) { int NX = u.shape().size()[0] - 1; int NY = u.shape().size()[1] - 1; int NZ = u.shape().size()[2] - 1; int nx = u(T(0,0,0)).shape().size()[0] - 1; int ny = u(T(0,0,0)).shape().size()[1] - 1; int nz = u(T(0,0,0)).shape().size()[2] - 1; //north-south Traits::Default::async(); if (NX > 0) u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(1, NX), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(0,0), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); if (NX > 0) u((R(1, NX), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); u((R(0, 0), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); Traits::Default::sync(); Traits::Default::async(); //east-west if (NY > 0) u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(1, 1), R(1, nz-1))); u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(0,0), R(0, NZ)))((R(0,nx), R(1, 1), R(1, nz-1))); if (NY > 0) u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny-1, ny-1), R(1, nz-1))); u((R(0, NX), R(0, 0), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0,nx), R(ny-1, ny-1), R(1, nz-1))); Traits::Default::sync(); Traits::Default::async(); //front-back if (NZ > 0) u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(1, 1))); u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(1, 1))); if (NZ > 0) u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); Traits::Default::sync(); } Overlap * ol = new Overlap (T (1,1,1),T (1,1,1), PERIODIC); MG comm3 in HTA with overlapped tiling

27 27 MG comm3 in NAS (Fortran + MPI) subroutine comm3(u,n1,n2,n3,kk) implicit none include 'mpinpb.h' include 'globals.h' integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis if(.not. dead(kk) )then do axis = 1, 3 if( nprocs.ne. 1) then call ready( axis, -1, kk ) call ready( axis, +1, kk ) call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo else call zero3(u,n1,n2,n3) endif return end subroutine give3( axis, dir, u, n1, n2, n3, k ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id buff_id = 2 + dir buff_len = 0 if( axis.eq. 1 )then if( dir.eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir.eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif if( axis.eq. 2 )then if( dir.eq. -1 )then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir.eq. +1 ) then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif if( axis.eq. 3 )then if( dir.eq. -1 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir.eq. +1 ) then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif return end subroutine take3( axis, dir, u, n1, n2, n3 ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer buff_id, indx integer status(mpi_status_size), ierr integer i3, i2, i1 call mpi_wait( msg_id( axis, dir, 1 ),status,ierr) buff_id = 3 + dir indx = 0 if( axis.eq. 1 )then if( dir.eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo else if( dir.eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 2 )then if( dir.eq. -1 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo else if( dir.eq. +1 ) then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 3 )then if( dir.eq. -1 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo else if( dir.eq. +1 ) then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo endif return end subroutine ready( axis, dir, k ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, k integer buff_id,buff_len,i,ierr buff_id = 3 + dir buff_len = nm2 do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me call mpi_irecv( buff(1,buff_id), buff_len, > dp_type, nbr(axis,-dir,k), msg_type(axis,dir), > mpi_comm_world, msg_id(axis,dir,1), ierr) return end ready give3take3

28 28 3D Jacobi NAS MG class C NAS LU class B NAS LU class C

29 29 Outline Introduction of HTA Dynamic partitioning Overlapped tiling Impact on programming productivity Related work Conclusions

30 30 Three metrics Programming effort [ Halstead, 1977] Program volume V A function of the number of operators and operands and their total number of occurrences. Potential volume V* The most succinct form in a language which has defined or implemented the required operations. Program complexity [ McCabe,1976] C = P + 1, where P is the number of decision points in the program Source lines of codes L

31 31 Evaluation ProgramsEffortComplexityLOC staticLU HTA61,545310 staticLU NDS208,074849 staticLU LAPACK160,509637 dynamicLU HTA51,599113 dynamicLU FLAME170,477152 recLU HTA85,530118 recLU ATLAS186,8911040 Sylvester HTA423,404247 Sylvester FLAME700,629595

32 32 Related work FLAME API [Bientinesi et al., 2005] Ad-hoc notations Sequoia [Fatahalian et al., 2006] Principal construct: task HPF [Hiranandani et al., 1992] and Co-Array Fortran [Numrich and Reid, 1998] Tiles are used for distribution Do not address different levels of memory hierarchy POOMA [Reynders,1996] Tiles and shadow regions are accessed as a whole Global Arrays [Nieplocha et al., 2006] SPMD programming model

33 33 Conclusion HTA makes tiles part of a language It provides generalized framework to express tiles. It increases productivity Less index calculation, fewer variables, loops, simpler function interface Dynamic partitioning Overlapped tiling Little performance degradation

34 34

35 35 Code example: 2D Jacobi Without overlapped tilingWith overlapped tiling HTA takes care of allocation of shadow regions, data consistency Clean indexing syntax

36 36 Two types of stencil computation Concurrent computation Each tile can be executed independently Example: Jacobi, MG Wavefront computation The execution sequence of tiles follows a certain order. Example: LU, SSOR With the update on read policy, minimal number of communications is achieved. computationShadow region update in every iteration computation Shadow region update in second iteration


Download ppt "Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,"

Similar presentations


Ads by Google