Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming with CellSs BSC. ScicomP15, Cell tutorial, May 18 th 2009 Outline StarSs Programming Model CellSs runtime CellSs syntax CellSs compiler Programming.

Similar presentations


Presentation on theme: "Programming with CellSs BSC. ScicomP15, Cell tutorial, May 18 th 2009 Outline StarSs Programming Model CellSs runtime CellSs syntax CellSs compiler Programming."— Presentation transcript:

1 Programming with CellSs BSC

2 ScicomP15, Cell tutorial, May 18 th 2009 Outline StarSs Programming Model CellSs runtime CellSs syntax CellSs compiler Programming examples Performance analysis using Paraver Conclusions

3 ScicomP15, Cell tutorial, May 18 th 2009 Cell/B.E. Architecture Users' point of view So, what is the Cell/B.E.? Architecture point of view SPE PPESPE Separate address spaces Tiny local memory Bandwidth Thin processor SMT Hard to optimize Programmers' point of view

4 ScicomP15, Cell tutorial, May 18 th 2009 STARSs programming model Basic idea... for (i=0; i<N; i++){ T1 (data1, data2); T2 (data4, data5); T3 (data2, data5, data6); T4 (data7, data8); T5 (data6, data8, data9); }... Sequential Application T1 0 T2 0 T3 0 T4 0 T5 0 T1 1 T2 1 T3 1 T4 1 T5 1 T1 2 … Resource 1 Resource 2 Resource 3 Resource N...... Task graph creation based on data precedence Task selection + parameters direction (input, output, inout)‏ Scheduling, data transfer, task execution Synchronization, results transfer Parallel Resources (multicore, SMP, cluster, grid)‏

5 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Syntax example - matrix multiply int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++)‏ block_addmultiply (C[i][j], A[i][k], B[k][j]); } static void block_addmultiply (float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; } BS NB BS

6 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Syntax example - matrix multiply int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++)‏ block_addmultiply( C[i][j], A[i][k], B[k][j]); } #pragma css task input(A, B) inout(C)‏ static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; } B B NB B B

7 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime PPE User main program CellSs PPU lib SPE 0 DMA in Task execution DMA out Synchronization CellSs SPU lib Original task code Helper threadMain thread Memory User data Task control buffer Synchronization Tasks Finalization signal Stage in/out data Work assignment Data dependence Data renaming Scheduling SPE 1 SPE 2 Renaming table...

8 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime - argument renaming False dependences (WaW and WaR) are removed with dynamic renaming of arguments for (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…); } Block1 is output from task T1 Block1 is input to task T2 block1 T1_1 T2_1 T3_1 T1_2 T2_2 T3_2 T1_N T2_N T3_N … block1 WaR WaW WaR WaW WaR WaW

9 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime - argument renaming False dependences (WaW and WaR) are removed with dynamic renaming of arguments for (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…); } Block1 is output from task T1 Block1 is input to task T2 block1_N block1_2 T1_1 T2_1 T3_1 T1_2 T2_2 T3_2 T1_N T2_N T3_N … block1_1 WaR WaW WaR WaW WaR WaW

10 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime – scheduling Scheduling strategy Critical path Locality Bundle of dependent tasks: data locality in SPE Bundle of independent tasks: Mixed bundle

11 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime – scheduling R i+1 RiRi uv ReadyLocs(u) = {@A, @B} ReadyLocs(v) = {@C, @D} LocHints(SPE j ) = {@X, @Y, @B, @Z} LocHints (SPE j+1 )={@U, @V, @W} R i+1 RiRi vu Scheduling for locality Ready lists (R i ). Higher subindex indicates higher priority according to memory locality Scheduling selects tasks from high priority ready lists (higher “i”)

12 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime – scheduling “Co-parent” edges Co-parent edges are added between tasks that share a direct descendent Odep(u), number of outstanding dependences of task u outside the current bundle

13 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime – scheduling “Co-parent” edges Co-parent edges are added between tasks that share a direct descendent Maximum of two co-parent edges (due to implementation costs)‏

14 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime Paraver view of the runtime behavior Bundle Main thread: runs user code and adds and remove tasks to the task graph SPEs: execute tasks' code Helper thread: schedules tasks and synchronize with SPEs

15 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime – specific SPE library features Data dependence analysis, data renaming, task scheduling performed in the CellSs PPE runtime library CellSs SPE runtime library implements specific features, to assist the CellSs PPE runtime library, but independently Early callback Minimal stage-out Software cache in the SPE Local Store Double buffering

16 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime – specific SPE library features Early call-back Initially, communication of completion of tasks is done per bundle basis There are cases where this limits the application Task A in the example An early callback after the limiting task, enables the scheduling of new bundles Condition: the task has more than one outgoing dependency

17 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime – specific SPE library features Minimal stage-out For each task in a bundle its outpus will be written back to main memory If inside the bundle, a task rewrites the same output, there is no need for writing back to main memory The case in the figure can not happen! Thanks to renaming Example: matmul C[i][j] += A[i][k]*B[k][j] X Y X Z writes A' writes A reads A X Y X Z writes A reads A X Y X Z writes A reads A

18 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime – specific SPE library features Software cache in the SPE Local Store Maintained by the SPE runtime LRU replacement strategy PPE scheduling is not aware of this behavior

19 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime – specific SPE library features... #pragma css task input(A, B) inout(C)‏ block_addmultiply( C[i][j], A[i][k], B[k][j])‏ C[i][j] A[i][k]B[k][j] For each operation, two blocks of data are get from PPE memory to SPE local storage Clusters of dependent tasks are scheduled to the same PPE The inout block is kept in the local storage and only put in PPE memory once (reuse)‏

20 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime - specific SPE library features Double buffering CellSs overlaps DMA transfers with computations DMA programming: reading task control buffer Waiting for DMA transfer DMA programming: reading data Task execution overlapped with data transfers DMA programming: writing data Task 1 in bundleTask 2 in bundleTask N in bundle Synchronization with helper thread...

21 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime - specific SPE library features Double buffering: paraver view SPE reads data SPE executes task

22 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime - specific SPE library features Double buffering: paraver view DMA programming SPE waits for DMA in

23 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime - specific SPE library features Double buffering: paraver view DMA out programming DMA in programming SPE waits for DMA in

24 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Runtime - specific SPE library features Double buffering: paraver view DMA out programming SPE waits for DMA out (all)‏

25 ScicomP15, Cell tutorial, May 18 th 2009 Break

26 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Syntax pragmas' syntax: #pragma css task [input ( )] \ [output ( )] \ [inout ( )] \‏ [highpriority] void task( ) {... #pragma css wait on( )‏ #pragma css barrier #pragma css start #pragma css finish

27 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Syntax Examples: task selection #pragma css task input(A, B) inout(C)‏ void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) {... #pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])‏ void block_addmultiply( float *C, float *A, float *B ) {.. #pragma css task input(A[BS][BS], B[BS][BS], BS) inout(C[BS][BS])‏ void block_addmultiply( float *C, float *A, float *B, int BS ) {... Examples: waiting for data #pragma css task input (ref_block, to_comp) output (mse)‏ void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) {...... are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error); #pragma css wait on (sq_error)‏ if (sq_error >0.0000001)‏

28 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Syntax Examples: synchronization for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++)‏ block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css barrier Examples: priorization #pragma css task input(lefthalo[32], tophalo[32], righthalo[32], \ bottomhalo[32]) inout(A[32][32]) highpriority void jacobi (float *lefthalo, float *tophalo, float *righthalo, float *bottomhalo, float *A)‏ {... }

29 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Syntax Examples: CellSs program boundary #pragma css start for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++)‏ block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css finish

30 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Syntax in Fortran subroutine example()‏... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS)‏ imtlicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS)‏ real, intent (inout) :: C(BS,BS)‏ end subroutine end interface... !$CSS START... call block_add_multiply(C, A, B, BLOCK_SIZE)‏... !$CSS FINISH... end subroutine !$CSS TASK subroutine block_add_multiply(C, A, B, BS)‏... end subroutine

31 ScicomP15, Cell tutorial, May 18 th 2009 CellSs compiler: Compiler phase Code translation (mcc)‏ cellss-spu-cc_app.c pack app.tasks (tasks list)‏ app.c cellss-spu-cc_app.o app.o CELSS-CC cellss-ppu-cc_app.c SPE Compiler PPE Compiler cellss-spu-cc_app.o

32 ScicomP15, Cell tutorial, May 18 th 2009 CellSs compiler: Compiler phase Files app.c: User code, with CellSs annotations cellss-spu-cc_app.c: specific code generated for the spu (tasks code)‏ cellss-ppu-cc_app.c: specific code generated for the ppu (main program)‏ app.tasks: list of annotated tasks Compilation steps Mcc: Mercurium compiler (BCS), source to source compiler SPE compiler: Generic SPE compiler (IBM SDK)‏ PPE compiler: Generic PPE compiler (IBM SDK)‏ pack: Specific CellSs module that combines objects (BSC)‏

33 ScicomP15, Cell tutorial, May 18 th 2009 CellSs compiler: Linker phase app.c unpack app-adapters.c exec libCellSS.so glue code generator app.c app.o app.tasks exec-adapters.c app-adapters.cc cellss-spu-cc_app.o exec-registration.c exec-adapters.o exec-registration.o CELLSS-CC app-adapters.c app-adapters.cc cellss-ppu-cc_app.o PPE Linker exec-spu SPE Compiler PPE Compiler SPE Embedder SPE Linker libCellSS-spu.a exec-spu.o app.tasks

34 ScicomP15, Cell tutorial, May 18 th 2009 CellSs compiler: Linker phase Files exec-adapters.c: code generated for each of the annotated tasks to uniformly call them (“stubs”). exec-registration.c: code generated to register the annotated tasks Linker steps unpack: unpacks objects glue code generator: from all the *.tasks files of an application generates a single “adapters” file and a single “registration” file per executable SPE, PPE compilers and linkers and SPE embedder (IBM SDK)

35 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Cholesky factorization Common matrix operation used to solve normal equations in linear least squares problems. Calculates a triangular matrix (L) from a symetric and positive defined matrix A. Cholesky(A) = L L · L t = A Different possible implementations, depending on how the matrix is traversed (by rows, by columns, left-looking, right-looking)‏ It can be decomposed in block operations

36 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples In each iteration red and blue blocks are updated SPOTRF: Computes the Cholesky factorization of the diagonal block. STRSM: Computes the column panel SSYRK: Computes the row panel SGEMM: Updates the rest of the matrix block_syrk

37 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples main (){... for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); }... for (int i = 0; i < DIM; i++)‏ { for (int j = 0; j < DIM; j++)‏ { #pragma css wait on (A[i][j]) print_block(A[i][j]); }... } #pragma css task input(A[64][64], B[64][64]) inout(C[64][64])‏ void sgemm_tile(float *A, float *B, float *C)‏ #pragma css task input (T[64][64]) inout(B[64][64])‏ void strsm_tile(float *T, float *B)‏ #pragma css task input(A[64][64]) inout(C[64][64])‏ void ssyrk_tile(float *A, float *C)‏ #pragma css task inout(A[64][64])‏ void spotrf_tile(float *A)‏ DIM 64 Cholesky factorization

38 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples main (){... for (int j = 0; j < DIM; j++){ for (int k= 0; k< j; k++){ for (int i = j+1; i < DIM; i++){ // A[i,j] = A[i,j] - A[i,k] * (A[j,k])^t css_sgemm_tile( A[i][k], A[j][k], A[i][j] ); } for (int i = 0; i < j; i++){ // A[j,j] = A[j,j] - A[j,i] * (A[j,i])^t css_ssyrk_tile(A[j][i],A[j][j]); } // Cholesky Factorization of A[j,j] css_spotrf_tile( A[j][j] ); for (int i = j+1; i < DIM; i++){ // A[i,j] <- A[i,j] = X * (A[j,j])^t css_strsm_tile( A[j][j], A[i][j] ); }... for (int i = 0; i < DIM; i++)‏ { for (int j = 0; j < DIM; j++)‏ { #pragma css wait on (A[i][j]) print_block(A[i][j]); }... } #pragma css task input(A[64][64], B[64][64]) inout(C[64][64])‏ void sgemm_tile(float *A, float *B, float *C)‏ #pragma css task input (T[64][64]) inout(B[64][64])‏ void strsm_tile(float *T, float *B)‏ #pragma css task input(A[64][64]) inout(C[64][64])‏ void ssyrk_tile(float *A, float *C)‏ #pragma css task inout(A[64][64])‏ void spotrf_tile(float *A)‏ DIM 64 Cholesky factorization

39 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Sparse LU More generic factorization than Cholesky Deals with non symetric matrixes Calculates one lower triangular matrix (L) and one upper triangular(U) matrix which product fits with a permutation of rows of the original Perm(A)=L*U Difficult to program for Cell, since some operations are for columns (not blocks)‏ The example shown here is a simplified version (without pivoting) based on an initial sparse matrix

40 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++)‏ if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL)‏ A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } B B NB B B void lu0(float *diag); void bdiv(float *diag, float *row); void bmod(float *row, float *col, float *inner); void fwd(float *diag, float *col); Sparse LU

41 ScicomP15, Cell tutorial, May 18 th 2009 Dynamic main memory allocation Data dependent parallelism int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++)‏ if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL)‏ A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } CellSs: Programming examples #pragma css task inout(diag[B][B]) highpriority void lu0(float *diag); #pragma css task input(diag[B][B]) inout(row[B][B])‏ void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])‏ void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B])‏ void fwd(float *diag, float *col);

42 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } #pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]); void copy_mat (float *Src,float *Dst)‏ {... for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++)‏... copy_block(Src[ii][jj],block);... } #pragma gss task input(A) out(L,U)‏ void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]); void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB])‏ {... for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++){... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]);... } Checking LU

43 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } Checking LU void clean_mat (p_block_t Src[NB][NB])‏ { int ii, jj; for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++)‏ if (Src[ii][jj] != NULL) { free (Src[ii][jj]); Src[ii][jj]=NULL; } #pragma css task output(Dst)‏ void clean_block (float Dst[BS][BS] ); void clean_mat (p_block_t Src[NB][NB])‏ { int ii, jj; for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++)‏ if (Src[ii][jj] != NULL) { clean_block(Src[ii][jj]); }

44 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } Checking LU void sparse_matmult (float *A[NB][NB], float *B[NB][NB], float *C[NB][NB])‏ { int ii, jj, kk; for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++)‏ for (kk=0; kk<NB; kk++)‏ if ((A[ii][kk]!= NULL) && (B[kk][jj] !=NULL )) { if (C[ii][jj] == NULL) C[ii][jj] = allocate_clean_block(); block_matmul (A[ii][kk], B[kk][jj], C[ii][jj]); } #pragma css task input(a,b) inout(c)‏ void block_matmul(float a[BS][BS], float b[BS][BS], float c[BS][BS])‏ { int i, j, k; for (i=0; i<BS; i++)‏ for (j=0; j<BS; j++)‏ for (k=0; k<BS; k++)‏ c[i][j] += a[i][k]*b[k][j]; }

45 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } Checking LU #pragma css task input (ref_block, to_comp) output (mse)‏ void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e); void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop)‏ {... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++)‏ for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL)‏ if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); } #pragma css finish for (ii = 0; ii < NB; ii++)‏ for (jj = 0; jj < NB; jj++)‏ if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii, jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n"); }

46 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } Checking LU #pragma css task input (ref_block, to_comp) output (mse)‏ void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e); void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop)‏ {... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++)‏ for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL)‏ if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); } for (ii = 0; ii < NB; ii++)‏ for (jj = 0; jj < NB; jj++)‏ #pragma css wait on (&sq_error[ii][jj])‏ if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii, jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n"); }

47 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat(A); sparse_matmult (L, U, A); compare_mat (origA, A); Without CellSsWith CellSs (for NB=4 matrix)‏ Behavior Checking LU

48 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Behavior Checking LU 0: are_blocks_equal 1: bdiv_adapte 2: block_mpy_add 3: bmod 4: clean_block 5: copy_block 6: fwd 7: lu0 8: split_block

49 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Molecular dynamics: Argon simulation Simulates the mobility of Argon atoms in gas state, in a constant volume at T=300K All elestrostatic forces observed for each of the atoms due to the others are considered (F i )‏ The second Newton law is then applied to each atom F i =m*a i The initial velocities are random but reasonable for argon atoms at 300K To maintain a constant temperature in all the process the Berendsen algorithm is applied

50 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples program argon... !$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii), z(ii), x(jj), y(jj), z(jj), vx(ii), vy(ii), vz(ii))‏ enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))‏ enddo !$CSS BARRIER tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins)‏ do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii), vz(ii), x(ii), y(ii), z(ii))‏ enddo !$CSS FINISH end program argon... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)‏ implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)‏ implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz)‏ implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE)‏ real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface Molecular dynamics: Argon simulation

51 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples program argon... !$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii), z(ii), x(jj), y(jj), z(jj), vx(ii), vy(ii), vz(ii))‏ enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))‏ enddo !$CSS BARRIER tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins)‏ do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii), vz(ii), x(ii), y(ii), z(ii))‏ enddo !$CSS FINISH end !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)‏ ! subroutine code end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)‏ ! subroutine code end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz)‏ ! subroutine code end subroutine Molecular dynamics: Argon simulation

52 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Vector reduction... Array A BS... NB

53 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Vector Reduction int main(int argc, char* argv[])‏ { LEVELS = log2 ((double)NB/BS); #pragma css start for (level = 0 ;level < LEVELS; level++){ range = exp2 ((double)level); for(i=0;i<NB;i+=2*BS*range)‏ block_reduce(&A[i],&A[i+BS*range]); } block_reduce2(&A[0], &reduction); #pragma css finish } #pragma css task input(B[64*64]) inout(A[64*64])‏ void block_reduce(float *A, float *B)‏ { int i; for (i=0; i<BS; i++)‏ A[i] += B[i]; } #pragma css task input(A) output(x)‏ void block_reduce2(float *A, float *x)‏ { int i; *x = 0.0; for (i=0; i<BS; i++)‏ *x += A[i]; }

54 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Vector reduction... Array A BS... NB neutral element - Less concurrency for one vector - Fine when considering several

55 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Vector Reduction int main(int argc, char* argv[])‏ { LEVELS = log2 ((double)NB/BS); #pragma css start for (i=0; i<NB; i+= BS)‏ block_reduce(&RB[0], &A[i]); block_reduce2(&RB[0], &reduction); #pragma css finish } #pragma css task input(B[64*64]) inout(A[64*64])‏ void block_reduce(float *A, float *B)‏ { int i; for (i=0; i<BS; i++)‏ A[i] += B[i]; } #pragma css task input(A) output(x)‏ void block_reduce2(float *A, float *x)‏ { int i; *x = 0.0; for (i=0; i<BS; i++)‏ *x += A[i]; }

56 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Current CellSs version is not able to easily deal with non-contiguous data objects Also, due to Local Storage (LS) management as a software cache, CellSs needs to control dynamic memory allocation in the SPU: CellSs offers a API to deal with: Explicit data transfers Dynamic memory allocation in the SPU

57 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Dynamic memory allocation #include void *css_malloc (unsigned int size); void css_free (void *chunk);

58 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Example: Dynamic memory allocation #pragma css task input(bs, log2_N, is_forward, twiddle) inout(data, sync)‏ static void FFT1D_1 (int bs, int log2_N, float twiddle[CUBE_SIZE*2], int is_forward, float data[bs][2*CUBE_SIZE], int sync[1])‏ { FFT1D_core ( bs, data, log2_N, twiddle, is_forward); } static void FFT1D_core (int bs, float data[bs][2*CUBE_SIZE], int log2_N, float twiddle[CUBE_SIZE*2], int is_forward)‏ { int i; int n_floats_elems = (1 << log2_N)*2; float *work_re = css_malloc(sizeof(float)*n_floats_elems); float *work_im = css_malloc(sizeof(float)*n_floats_elems); for(i=0; i<bs; i++)‏ spe_FFT_1D_core (log2_N, &data[i][0], twiddle, is_forward, work_re, work_im); css_free((void *)work_re); css_free((void *)work_im); }

59 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples DMA accesses CellSs handles all data transfers from main memory to SPU Local Store Some applications may need to do explicit data transfer from main memory For transfers of 1, 2, 4, 8 bytes or multiples of 16 bytes up to 16 KB #include void css_get_a (void *ls, uint32_t ea, unsigned int dma_size, tagid_t tag); void css_put_a (void *ls, uint32_t ea, unsigned int dma_size, tagid_t tag); ls: pointer to a 16-byte aligned allocated buffer in LS ea: pointer to main memory dma_size: size of the block tag: identifies of the DMA transfer

60 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples DMA accesses Tag obtention: returns a valid tag for a DMA transfer tagid_t css_tag (void); Synchronization void css_sync (tagid_t tag); For DMA transfers not meeting the previous requirements void css_get (void *ls, unsigned int address, unsigned int dma_size, tagid_t tag); void css_put (void *ls, unsigned int address, unsigned int dma_size, tagid_t tag); Example: float *blocks = (float *)css_malloc(N*sizeof(Complex)); tag = css_tag (); css_get_a (blocks, addr, (unsigned int)(N*sizeof(Complex)), tag); css_sync(tag);

61 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Strided Memory access Interface to scatter/gather data from 1D, 2D and 3D arrays #include dmal_h_t *css_gather_1d (void *ls, unsigned int start, int chunk, int stride, size_t size, size_t e_size, dmal_h_t *c_list); dmal_h_t *css_scatter_1d (void *ls, unsigned int start, int chunk, int stride, size_t size, size_t e_size, dmal_h_t *c_list); ls: pointer to a 16-byte aligned allocated buffer in LS c_list: enables to use the same pattern to access memory, reuses DMA lists size: number of objects to be copied e_size: size of one element start chunkstride

62 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Strided Memory access #include dmal_h_t *css_gather_2d (void *ls, unsigned int start, int local_x, int local_y, int global_x, size_t e_size, dmal_h_t *c_list); dmal_h_t *css_scatter_2d (void *ls, unsigned int start, int local_x, int local_y, int global_x, size_t e_size, dmal_h_t *c_list); local_x local_y global_x start

63 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples Strided Memory access #include dmal_h_t *css_gather_3d (void *ls, unsigned int start, int local_x, int local_y, int local_z, int global_x, int global_z, size_t e_size, dmal_h_t *c_list); dmal_h_t *css_scather_3d (void *ls, unsigned int start, int local_x, int local_y, int local_z, int global_x, int global_z, size_t e_size, dmal_h_t *c_list); Example: #pragma css task input ( A_p) output (A[16*16])‏ void example (float *A, unsigned int A_p)‏ { dmal_h_t *entry = css_gather_1d (A, A_p, 4, 16, 64, sizeof(float), NULL); css_sync(entry->tag); }

64 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples void sequential_cholesky(void)‏ { int STEP; int bm; for (STEP = 0; STEP <= STEPS-1; STEP++)‏ { // Update and factorize the current diagonal block my_cholesky_ssyrk (STEP, NB, N, A); my_cholesky_spotf2(STEP, NB, N, A); if (STEP < STEPS-1)‏ { // Compute the current block column for (bm = 0; bm < STEPS-STEP-1; bm++)‏ { my_cholesky_sgemm(STEP, bm, NB, N, A); my_cholesky_strsm(STEP, bm, NB, N, A); } void my_cholesky_ssyrk(int STEP, int nb, int N, float *A)‏ { for (int i = 0; i < STEP; i++) // rank update for A[d][d] { ssyrk(A[STEP*B][i*B],A[STEP*B][STEP*B]); } A Original matrix A stored in consecutive positions in memory by rows Another Cholesky N = NB x B NB x B B

65 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples void sequential_cholesky(void)‏ { int STEP; int bm; for (STEP = 0; STEP <= STEPS-1; STEP++)‏ { // Update and factorize the current diagonal block my_cholesky_ssyrk (STEP, NB, N, A); my_cholesky_spotf2(STEP, NB, N, A); if (STEP < STEPS-1)‏ { // Compute the current block column for (bm = 0; bm < STEPS-STEP-1; bm++)‏ { my_cholesky_sgemm(STEP, bm, NB, N, A); my_cholesky_strsm(STEP, bm, NB, N, A); } void my_cholesky_ssyrk(int STEP, int nb, int N, float *A)‏ { for (int i = 0; i < STEP; i++) // rank update for A[d][d] { check_data_av(A, ShA, STEP, i, N, nb, B); check_data_av(A, ShA, STEP, STEP, N, nb, B); ssyrk (ShA[STEP*nb+i], ShA[STEP*nb+STEP]); } NB B B ShA A STEP i

66 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples void check_data_av(float* M, float** shadow, int i, int j, int N, int nb, int B)‏ { int pp; if (shadow[i*nb+j]==NULL) { shadow[i*nb+j] = (float* )malloc(B*B*sizeof(float)); pp = (int)&M[i*nb*B+j*B]; copy_to_shadow_block (&M[i*nb*B+j*B], pp, B, N, shadow[i*nb+j]); } void copy_back_to_matrix(float* M, float** shadow, int N, int nb, int B)‏ { int i, j, pp; for (i = 0; i < nb; i++) { for (j = 0; j < nb; j++) { if (shadow[i*nb+j]!=NULL) { pp = (int)&M[i*nb*B+j*B]; copy_from_shadow_block (&M[i*nb*B+j*B],pp, nb, N, shadow[i*nb+j]); } #pragma css task input (address[1], main_address, b, n) output (WA[64][64]) void copy_to_shadow_block (float *address, int main_address, int b, int n, float *WA)‏ #pragma css task input (WA[64][64], main_address, b, n) inout (address[1]) void copy_from_shadow_block (float * address, int main_address, int b, int n, float *WA)‏ #pragma css task inout(A[64][64]) highpriority void spotrf_tile(float *A)‏ #pragma css task input (A[64][64]) inout(B[64][64]) void ssyrk_tile(float *A, float *B)‏ #pragma css task input(A[64][64], B[64][64]) inout(C[64][64])‏ void sgemm_tile(float *A, float *B, float *C)‏ #pragma css task input (T[64][64]) inout(B[64][64])‏ void strsm_tile(float *T, float *B)‏ Could be managed as a cache !!!

67 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Programming examples #pragma css task input (address[1], main_address, b, n) output (WA[64][64]) void copy_to_shadow_block (float *address, int main_address, int b, int n, float *WA)‏{ // address is a trick to ensure dependencies // address points to the first element of the block as representantion // of the whole block dmal_h_t *entry; entry = css_gather_2d (WA, main_address, b, b, n, sizeof(float), NULL); ls_sync(entry->tag); } #pragma css task input (WA[64][64], main_address, nb, n) inout (address[1])‏ void copy_from_shadow_block (float * address, int main_address, int b, int n, float *WA) { dmal_h_t *entry; address[0]=WA[0]; // as address is inout, when the task finishes it copies back its local value // to the original position in main memory, so we need to assign the correct // value to that local variable. entry = css_scather_2d (WA, main_address, b, b, n, sizeof(float), NULL); ls_sync(entry->tag); }

68 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance Analysis with Paraver Paraver Flexible performance visualization and analysis tool that can be used to analyze: MPI, OpenMP, MPI+OpenMP Java Hardware counters profile Operating system activity... and many other things you may think of Generally it uses external trace file generators. Example for MPI: > mpitrace mpirun -n 10 my_mpi-binary For CellSs, the libraries have been instrumented. When installing the distribution, two libraries are generated: normal and instrumented Flag -t links with instrumented version Available for free from the BSC website: www.bsc.es/paraver

69 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance Analysis with Paraver Running paraver paraver tracefile-0001.prv

70 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance Analysis with Paraver Configuration files

71 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance Analysis with Paraver Configuration files

72 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance Analysis with Paraver Clustering Group of 8 tasks (23 us)‏ Block size: 64x64 floats DMA in/out Data re-use Main thread Helper thread

73 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance Analysis with Paraver Demo matmul application first view, explain what is seen show cfgs, explain that they are in the distribution and where $(CELLSS_HOME)/share/cellss/paraver_cfgs/ matmul show execution phases, tasks (and task type), task number show flushing size of DMA in another cholesky execution phases, tasks, task number, order of tasks that copy data from memory

74 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance Analysis with Paraver Demo Argon Efect of CellSs configuration file trace 1: Initial tracefile with default settings was showing a large overhead around the barrier, due to the task generation phase before re-scheduling again traza 2: original application with scheduler.min_tasks = 1 traza 3: with scheduler.min_tasks = 2. Both cases: small number of tasks are scheduled together, the effect of the barrier is reduced traza 2 : with scheduler.min_tasks = 4. More tasks are scheduled together, but unbalance traza 1: for loops that calls velocity task are exchanged (ready tasks are generated more quickly), with scheduler.min_tasks = 1 Check_LU

75 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance Analysis with Paraver Another Cholesky

76 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance evolution Performance: matrix multiply Versions with different task implementation Task duration: from 2000 µsecs (simple C scalar code)‏ to 22 µsecs (highly hand-vectorized/optimized code) July 2007 November 2007 April 2007

77 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance evolution Performance: Cholesky factorization April 2007 July 2007 November 2007

78 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: Performance evolution Task dependence graph for a 320 x 320 floats matrix (blocks of 64 x 64)‏

79 ScicomP15, Cell tutorial, May 18 th 2009 SMPSs performance evolution & scalability CholeskyMatrix multiply Strassen matrix multiply (*) Performance on SGI Altix 4700 ccNUMA machine

80 ScicomP15, Cell tutorial, May 18 th 2009 CellSs performance evolution & scalability Matrix multiply Cholesky

81 ScicomP15, Cell tutorial, May 18 th 2009 CellSs performance evolution & scalability

82 ScicomP15, Cell tutorial, May 18 th 2009 CellSs: issues and ongoing efforts CellSs programming model Array regions, subobject accesses Blocks larger than Local Store Access to global memory by tasks CellSs runtime system Further optimization of overheads (insert task and remove task) By-passing (SPE to SPE transfers)‏ Scheduling algorithms: overhead, locality Lazy renaming Other members of the family: SMPSs, GPUSs, hierarchical (SMPSs + CellSs)‏ Convergence with OpenMP 3.0

83 ScicomP15, Cell tutorial, May 18 th 2009 Conclusions The road for new chips with multi and many cores is open New programming models that can deal with the complexity of the hardware are now more needed than ever StarSs Simple Portable Enough performance Enabled for different architectures: CellSs, SMPSs, GPUSs

84 ScicomP15, Cell tutorial, May 18 th 2009 CellSs and SMPSs websites CellSs www.bsc.es/cellsuperscalar SMPSs www.bsc.es/smpsuperscalar Both available for download (open source, GPL and LGPL)‏


Download ppt "Programming with CellSs BSC. ScicomP15, Cell tutorial, May 18 th 2009 Outline StarSs Programming Model CellSs runtime CellSs syntax CellSs compiler Programming."

Similar presentations


Ads by Google