Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Similar presentations


Presentation on theme: "Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,"— Presentation transcript:

1 Programming with CellSs BSC

2 Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper

3 Programming with CellSs Outline CellSs StarSs Programming Model CellSs syntax CellSs compiler CellSs runtime Installing CellSs Programming examples Compiling and running a CellSs application Performance analysis using Paraver SMPSs Conclusions

4 Programming with CellSs Cell/B.E. Architecture Users' point of view So, what is the Cell/B.E.? Architecture point of view SPE PPESPE Separate address spaces Tiny local memory Bandwidth Thin processor SMT Hard to optimize Programmers' point of view

5 Programming with CellSs STARSs programming model Basic idea... for (i=0; i<N; i++){ T1 (data1, data2); T2 (data4, data5); T3 (data2, data5, data6); T4 (data7, data8); T5 (data6, data8, data9); }... Sequential Application T1 0 T2 0 T3 0 T4 0 T5 0 T1 1 T2 1 T3 1 T4 1 T5 1 T1 2 … Resource 1 Resource 2 Resource 3 Resource N...... Task graph creation based on data precedence Task selection + parameters direction (input, output, inout)‏ Scheduling, data transfer, task execution Synchronization, results transfer Parallel Resources (multicore,SMP, cluster, grid)‏

6 Programming with CellSs StarSs programming model GRIDSs, COMPSs Tailored for Grids or clusters Data dependency analysis based on files C/C++, Java SMPSs Tailored for SMPs or homogeneous multicores C or Fortran CellSs Tailored for Cell/B.E. processor C or Fortran

7 Programming with CellSs CellSs: Syntax example - matrix multiply int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++)‏ block_addmultiply( C[i][j], A[i][k], B[k][j]); } static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; } B B NB B B

8 Programming with CellSs CellSs: Syntax example - matrix multiply int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++)‏ block_addmultiply( C[i][j], A[i][k], B[k][j]); } #pragma css task input(A, B) inout(C)‏ static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; } B B NB B B

9 Programming with CellSs CellSs: Syntax pragmas' syntax: #pragma css task [input ( )] \ [output ( )] \ [inout ( )] \‏ [highpriority] void task( ) {... #pragma css wait on( )‏ #pragma css barrier #pragma css start #pragma css finish

10 Programming with CellSs CellSs: Syntax Examples: task selection #pragma css task input(A, B) inout(C)‏ void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) {... #pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])‏ void block_addmultiply( float *C, float *A, float *B ) {.. #pragma css task input(A[BS][BS], B[BS][BS], BS) inout(C[BS][BS])‏ void block_addmultiply( float *C, float *A, float *B, int BS ) {... Examples: waiting for data #pragma css task input (ref_block, to_comp) output (mse)‏ void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) {...... are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error); #pragma css wait on (sq_error)‏ if (sq_error >0.0000001)‏

11 Programming with CellSs CellSs: Syntax Examples: synchronization for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++)‏ block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css barrier Examples: priorization #pragma css task input(lefthalo[32], tophalo[32], righthalo[32], \ bottomhalo[32]) inout(A[32][32]) highpriority void jacobi (float *lefthalo, float *tophalo, float *righthalo, float *bottomhalo, float *A)‏ {... }

12 Programming with CellSs CellSs: Syntax Examples: CellSs program boundary #pragma css start for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++)‏ block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css finish

13 Programming with CellSs CellSs: Syntax in Fortran subroutine example()‏... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS)‏ imtlicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS)‏ real, intent (inout) :: C(BS,BS)‏ end subroutine end interface... !$CSS START... call block_add_multiply(C, A, B, BLOCK_SIZE)‏... !$CSS FINISH... end subroutine !$CSS TASK subroutine block_add_multiply(C, A, B, BS)‏... end subroutine

14 Programming with CellSs CellSs compiler: Compiler phase Code translation (mcc)‏ cellss-spu-cc_app.c pack app.tasks (tasks list)‏ app.c cellss-spu-cc_app.o app.o CELSS-CC cellss-ppu-cc_app.c SPE Compiler PPE Compiler cellss-spu-cc_app.o

15 Programming with CellSs CellSs compiler: Compiler phase Files app.c: User code, with CellSs annotations cellss-spu-cc_app.c: specific code generated for the spu (tasks code)‏ cellss-ppu-cc_app.c: specific code generated for the ppu (main program)‏ app.tasks: list of annotated tasks Compilation steps mcc: source to source compiler, based on the Mercurium compiler (BSC). SPE compiler: Generic SPE compiler (IBM SDK)‏ PPE compiler: Generic PPE compiler (IBM SDK)‏ pack: Specific CellSs module that combines objects (BSC)‏

16 Programming with CellSs CellSs compiler: Linker phase app.c unpack app-adapters.c exec libCellSS.so glue code generator app.c app.o app.tasks exec-adapters.c app-adapters.cc cellss-spu-cc_app.o exec-registration.c exec-adapters.o exec-registration.o CELLSS-CC app-adapters.c app-adapters.cc cellss-ppu-cc_app.o PPE Linker exec-spu SPE Compiler PPE Compiler SPE Embedder SPE Linker libCellSS-spu.a exec-spu.o app.tasks

17 Programming with CellSs CellSs compiler: Linker phase Files exec-adapters.c: code generated for each of the annotated tasks to uniformly call them (“stubs”). exec-registration.c: code generated to register the annotated tasks Linker steps unpack: unpacks objects glue code generator: from all the *.tasks files of an application generates a single “adapters” file and a single “registration” file per executable SPE, PPE compilers and linkers and SPE embedder (IBM SDK)

18 Programming with CellSs CellSs: Runtime PPE User main program CellSs PPU lib SPE 0 DMA in Task execution DMA out Synchronization CellSs SPU lib Original task code Helper threadMain thread Memory User data Task control buffer Synchronization Tasks Finalization signal Stage in/out data Work assignment Data dependence Data renaming Scheduling SPE 1 SPE 2 Renaming table...

19 Programming with CellSs CellSs: Runtime - argument renaming False dependences (WaW and WaR) are removed with dynamic renaming of arguments for (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…); } Block1 is output from task T1 Block1 is input to task T2 block1 T1_1 T2_1 T3_1 T1_2 T2_2 T3_2 T1_N T2_N T3_N … block1 WaR WaW WaR WaW WaR WaW

20 Programming with CellSs CellSs: Runtime - argument renaming False dependences (WaW and WaR) are removed with dynamic renaming of arguments for (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…); } Block1 is output from task T1 Block1 is input to task T2 block1_N block1_2 T1_1 T2_1 T3_1 T1_2 T2_2 T3_2 T1_N T2_N T3_N … block1_1 WaR WaW WaR WaW WaR WaW

21 Programming with CellSs CellSs: Runtime - Dependence detection @L@L Type, size,… *obj *producer *prev Object instance Object instance Object instance Type, size,… *obj *producer *prev # users Renaming table Last renaming Type, size,… *obj *producer *prev Task dependence graph

22 Programming with CellSs CellSs: Runtime – scheduling Scheduling strategy Critical path Locality Bundle of dependent tasks: data locality in SPE Bundle of independent tasks: Mixed bundle

23 Programming with CellSs CellSs: Runtime – scheduling R i+1 RiRi uv ReadyLocs(u) = {@A, @B} ReadyLocs(v) = {@C, @D} LocHints(SPE j ) = {@X, @Y, @B, @Z} LocHints (SPE j+1 )={@U, @V, @W} R i+1 RiRi vu Scheduling for locality Ready lists (R i ). Higher subindex indicates higher priority according to memory locality Scheduling selects tasks from high priority ready lists (higher “i”)

24 Programming with CellSs CellSs: Runtime – scheduling “Co-parent” edges Co-parent edges are added between tasks that share a direct descendent Odep(u), number of outstanding dependences of task u outside the current bundle

25 Programming with CellSs CellSs: Runtime – scheduling “Co-parent” edges Co-parent edges are added between tasks that share a direct descendent Maximum of two co-parent edges (due to implementation costs)‏

26 Programming with CellSs CellSs: Runtime – scheduling Scheduling algorithm R i : ready lists Btemp : candidates for being integrated in a bundle B bundle to be scheduled while not ScheduleStop { t = head (R M ) | M = max{i|0 < i < N} and R i not empty add_to_head (t, Btemp); while DepthSearch { u = head (Btemp); if Odep(u)==0 { add_to_tail (u, B); if ((b = CoParent (u)) !=0) add_to_head (b, Btemp); else if ((s = successor (u))!= 0 ) add_to_tail (s, Btemp); }

27 Programming with CellSs CellSs: Runtime – scheduling 1 5 6 8 12 15 16 17 2 7 9 34 13 10 14 11 Imagine R1 = {1} and R0 = {2, 3, 4, 7, 9, 10, 11} External loop t = 1; Btemp = {1} internal loop: iteration 1 u = 1; Btemp = { }; B = {u} b = 2; Btemp = {2};

28 Programming with CellSs CellSs: Runtime – scheduling 1 5 6 8 12 15 16 17 2 7 9 34 13 10 14 11 internal loop: iteration 2 u = 2; Btemp = { }; B = {1,2} b = 3; Btemp = {3}; internal loop: iteration 3 u = 3; Btemp = { }; B = {1,2,3} b = 4; Btemp = {4}; internal loop: iteration 4 u = 4; Btemp = { }; B = {1,2,3,4} s = 5; Btemp = {5};

29 Programming with CellSs CellSs: Runtime – scheduling 1 5 6 8 12 15 16 17 2 7 9 34 13 10 14 11 internal loop: iteration 5 u = 5; Btemp = { }; B = {1,2,3,4,5} s = 6; Btemp = {6}; internal loop: iteration 6 u = 6; Btemp = { }; B = {1,2,3,4,5,6} b = 7; Btemp = {7}; internal loop: iteration 7 u = 7; Btemp = { }; B = {1,2,3,4,5,6,7} s = 8; Btemp = {8};

30 Programming with CellSs CellSs: Runtime – scheduling 1 5 6 8 12 15 16 17 2 7 9 34 13 10 14 11 internal loop: iteration 8 u = 8; Btemp = { }; B = {1,2,3,4,5,6,7,8} b = 9; Btemp = {9}; internal loop: ends since maximum size of bundle is reached

31 Programming with CellSs CellSs: Runtime Paraver view of the runtime behavior Bundle Main thread: runs user code and adds and remove tasks to the task graph SPEs: execute tasks' code Helper thread: schedules tasks and synchronize with SPEs

32 Programming with CellSs CellSs: Runtime – specific SPE library features Data dependence analysis, data renaming, task scheduling performed in the CellSs PPE runtime library CellSs SPE runtime library implements specific features, to assist the CellSs PPE runtime library, but independently Early callback Minimal stage-out Software cache in the SPE Local Store Double buffering

33 Programming with CellSs CellSs: Runtime – specific SPE library features Early call-back Innitially, communication of completion of tasks is done in a per bundle basis There are cases where this limits the application Task A in the example An early callback after the limiting task, enables the scheduling of new bundles Condition: the task has more than one outgoing dependency

34 Programming with CellSs CellSs: Runtime – specific SPE library features Minimal stage-out For each task in a bundle its outpus will be written back to main memory If inside the bundle, a task rewrites the same output, there is no need for writing back to main memory The case in the figure can not happen! Thanks to renaming Example: matmul C[i][j] += A[i][k]*B[k][j] X Y X Z writes A' writes A reads A X Y X Z writes A reads A X Y X Z writes A reads A

35 Programming with CellSs CellSs: Runtime – specific SPE library features... #pragma css task input(A, B) inout(C)‏ block_addmultiply( C[i][j], A[i][k], B[k][j])‏ C[i][j] A[i][k]B[k][j] For each operation, two blocks of data are get from PPE memory to SPE local storage Clusters of dependent tasks are scheduled to the same PPE The inout block is kept in the local storage and only put in PPE memory once (reuse)‏

36 Programming with CellSs CellSs: Runtime – specific SPE library features Software cache in the SPE Local Store Maintained by the SPE runtime LRU replacement strategy PPE scheduling is not aware of this behavior

37 Programming with CellSs CellSs: Runtime - specific SPE library features Double buffering CellSs overlaps DMA transfers with computations DMA programming: reading task control buffer Waiting for DMA transfer DMA programming: reading data Task execution overlapped with data transfers DMA programming: writing data Task 1 in bundleTask 2 in bundleTask N in bundle Synchronization with helper thread...

38 Programming with CellSs CellSs: Runtime - specific SPE library features Double buffering: paraver view SPE reads data SPE executes task

39 Programming with CellSs CellSs: Runtime - specific SPE library features Double buffering: paraver view DMA programming SPE waits for DMA in

40 Programming with CellSs CellSs: Runtime - specific SPE library features Double buffering: paraver view DMA out programming DMA in programming SPE waits for DMA in

41 Programming with CellSs CellSs: Runtime - specific SPE library features Double buffering: paraver view DMA out programming SPE waits for DMA out (all)‏

42 Programming with CellSs CellSs: Installing CellSs Dowload code: www.bsc.es/cellsuperscalar -> download www.bsc.es/cellsuperscalar gunzip, tar Installing instructions in the CellSs manual www.bsc.es/cellsuperscalar -> documents www.bsc.es/cellsuperscalar Run configure script with installation directory as prefix./configure - -prefix=/opt/CellSS Other options can be specified./configure - - help make make install

43 Programming with CellSs CellSs: Programming examples Cholesky factorization Common matrix operation used to solve normal equations in linear least squares problems. Calculates a triangular matrix (L) from a symetric and positive definite matrix A. Cholesky(A) = L L · L t = A Different possible implementations, depending on how the matrix is traversed (by rows, by columns, left-looking, right-looking)‏ It can be decomposed in block operations

44 Programming with CellSs CellSs: Programming examples In each iteration red and blue blocks are updated SPOTF: Compute the Cholesky factorization of the diagonal block. STRSM: Compute the column panel SSYRK: Update the rest of the matrix

45 Programming with CellSs CellSs: Programming examples main (){... for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); }... for (int i = 0; i < DIM; i++)‏ { for (int j = 0; j < DIM; j++)‏ { #pragma css wait on (A[i][j]) print_block(A[i][j]); }... } #pragma css task input(A[64][64], B[64][64]) inout(C[64][64])‏ void sgemm_tile(float *A, float *B, float *C)‏ #pragma css task input (T[64][64]) inout(B[64][64])‏ void strsm_tile(float *T, float *B)‏ #pragma css task input(A[64][64]) inout(C[64][64])‏ void ssyrk_tile(float *A, float *C)‏ #pragma css task inout(A[64][64])‏ void spotrf_tile(float *A)‏ DIM 64 Cholesky factorization

46 Programming with CellSs CellSs: Programming examples Sparse LU More generic factorization than Cholesky Deals with non symetric matrixes Calculates one lower triangular matrix (L) and one upper triangular(U) matrix which product fits with a permutation of rows of the original Perm(A)=L*U Difficult to program for Cell, since some operations are for columns (not blocks)‏ The example shown here is a simplified version (without pivoting) based on an initial sparse matrix

47 Programming with CellSs CellSs: Programming examples int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++)‏ if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL)‏ A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } B B NB B B void lu0(float *diag); void bdiv(float *diag, float *row); void bmod(float *row, float *col, float *inner); void fwd(float *diag, float *col); Sparse LU

48 Programming with CellSs Dynamic main memory allocation Data dependent parallelism int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++)‏ if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL)‏ A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } CellSs: Programming examples #pragma css task inout(diag[B][B]) highpriority void lu0(float *diag); #pragma css task input(diag[B][B]) inout(row[B][B])‏ void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])‏ void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B])‏ void fwd(float *diag, float *col);

49 Programming with CellSs CellSs: Programming examples SPU memory funtionality: tailored CellSs API to deal with memory issues in the SPU Dynamic memory allocation Local Storage (LS) space in each SPU is limited, so CellSs tries to control as much of it as possible #include void *css_malloc (unsigned int size); void css_free (void *chunk);

50 Programming with CellSs CellSs: Programming examples Example: Dynamic memory allocation #pragma css task input(bs, log2_N, is_forward, twiddle) inout(data, sync)‏ static void FFT1D_1 (int bs, int log2_N, float twiddle[CUBE_SIZE*2], int is_forw ard, float data[bs][2*CUBE_SIZE], int sync[1])‏ { FFT1D_core ( bs, data, log2_N, twiddle, is_forward); } static void FFT1D_core (int bs, float data[bs][2*CUBE_SIZE], int log2_N, float twiddle[CUBE_SIZE*2], int is_forward)‏ { int i; int n_floats_elems = (1 << log2_N)*2; float *work_re = css_malloc(sizeof(float)*n_floats_elems); float *work_im = css_malloc(sizeof(float)*n_floats_elems); for(i=0; i<bs; i++)‏ spe_FFT_1D_core (log2_N, &data[i][0], twiddle, is_forward, work_re, work_im); css_free((void *)work_re); css_free((void *)work_im); }

51 Programming with CellSs CellSs: Programming examples DMA accesses CellSs handles all data transfers from main memory to SPU Local Store Some applications may need to do explicit data transfer from main memory For transfers of 1, 2, 4, 8 bytes or multiples of 16 bytes up to 16 KB #include void css_get_a (void *ls, uint32_t ea, unsigned int dma_size, tagid_t tag); void css_put_a (void *ls, uint32_t ea, unsigned int dma_size, tagid_t tag); ls: pointer to a 16-byte aligned allocated buffer in LS ea: pointer to main memory dma_size: size of the block tag: identifies of the DMA transfer

52 Programming with CellSs CellSs: Programming examples DMA accesses Tag obtention: returns a valid tag for a DMA transfer tagid_t css_tag (void); Synchronization void css_sync (tagid_t tag); For DMA transfers not meeting the previous requirements void css_get (void *ls, unsigned int address, unsigned int dma_size, tagid_t tag); void css_put (void *ls, unsigned int address, unsigned int dma_size, tagid_t tag); Example: float *blocks = (float *)css_malloc(N*sizeof(Complex)); tag = css_tag (); css_get_a (blocks, addr, (unsigned int)(N*sizeof(Complex)), tag); css_sync(tag);

53 Programming with CellSs CellSs: Programming examples Strided Memory access Interface to scatter/gather data from 1D, 2D and 3D arrays #include dmal_h_t *css_gather_1d (void *ls, unsigned int start, int chunk, int stride, size_t size, size_t e_size, dmal_h_t *c_list); dmal_h_t *css_scatter_1d (void *ls, unsigned int start, int chunk, int stride, size_t size, size_t e_size, dmal_h_t *c_list); ls: pointer to a 16-byte aligned allocated buffer in LS c_list: enables to use the same pattern to access memory, reuses DMA lists size: number of objects to be copied e_size: size of one element start chunkstride

54 Programming with CellSs CellSs: Programming examples Strided Memory access #include dmal_h_t *css_gather_2d (void *ls, unsigned int start, int local_x, int local_y, int global_x, size_t e_size, dmal_h_t *c_list); dmal_h_t *css_scatter_2d (void *ls, unsigned int start, int local_x, int local_y, int global_x, size_t e_size, dmal_h_t *c_list); local_x local_y global_x start

55 Programming with CellSs CellSs: Programming examples Strided Memory access #include dmal_h_t *css_gather_3d (void *ls, unsigned int start, int local_x, int local_y, int local_z, int global_x, int global_z, size_t e_size, dmal_h_t *c_list); dmal_h_t *css_scather_3d (void *ls, unsigned int start, int local_x, int local_y, int local_z, int global_x, int global_z, size_t e_size, dmal_h_t *c_list); Example: #pragma css task input ( A_p) output (A[16*16])‏ void example (float *A, unsigned int A_p)‏ { dmal_h_t *entry = css_gather_1d (A, A_p, 4, 16, 64, sizeof(float), NULL); css_sync(entry->tag); }

56 Programming with CellSs CellSs: Programming examples void sequential_cholesky(void)‏ { int STEP; int bm; for (STEP = 0; STEP <= STEPS-1; STEP++)‏ { // Update and factorize the current diagonal block my_cholesky_ssyrk (STEP, NB, N, A); my_cholesky_spotf2(STEP, NB, N, A); if (STEP < STEPS-1)‏ { // Compute the current block column for (bm = 0; bm < STEPS-STEP-1; bm++)‏ { my_cholesky_sgemm(STEP, bm, NB, N, A); my_cholesky_strsm(STEP, bm, NB, N, A); } void my_cholesky_ssyrk(int STEP, int nb, int N, float *A)‏ { for (int i = 0; i < STEP; i++) // rank update for A[d][d] { ssyrk(A[STEP*B][i*B],A[STEP*B][STEP*B]); } A Original matrix A stored in consecutive positions in memory by rows Another Cholesky N = NB x B NB x B B

57 Programming with CellSs CellSs: Programming examples void sequential_cholesky(void)‏ { int STEP; int bm; for (STEP = 0; STEP <= STEPS-1; STEP++)‏ { // Update and factorize the current diagonal block my_cholesky_ssyrk (STEP, NB, N, A); my_cholesky_spotf2(STEP, NB, N, A); if (STEP < STEPS-1)‏ { // Compute the current block column for (bm = 0; bm < STEPS-STEP-1; bm++)‏ { my_cholesky_sgemm(STEP, bm, NB, N, A); my_cholesky_strsm(STEP, bm, NB, N, A); } void my_cholesky_ssyrk(int STEP, int nb, int N, float *A)‏ { for (int i = 0; i < STEP; i++) // rank update for A[d][d] { check_data_av(A, ShA, STEP, i, N, nb, B); check_data_av(A, ShA, STEP, STEP, N, nb, B); ssyrk (ShA[STEP*nb+i], ShA[STEP*nb+STEP]); } NB B B ShA A STEP i

58 Programming with CellSs CellSs: Programming examples void check_data_av(float* M, float** shadow, int i, int j, int N, int nb, int B)‏ { int pp; if (shadow[i*B+j]==NULL) { shadow[i*B+j] = (float* )malloc(nb*nb*sizeof(float)); pp = (int)&M[i*N*B+j*B]; copy_to_shadow_block (&M[i*N*B+j*B], pp, B, N, shadow[i*nb+j]); } void copy_back_to_matrix(float* M, float** shadow, int N, int nb, int B)‏ { int i, j, pp; for (i = 0; i < nb; i++) { for (j = 0; j < nb; j++) { if (shadow[i*nb+j]!=NULL) { pp = (int)&M[i*N*B+j*B]; copy_from_shadow_block (&M[i*N*B+j*B],pp, nb, N, shadow[i*nb+j]); } #pragma css task input (address[1], main_address, b, n) output (WA[64][64]) void copy_to_shadow_block (float *address, int main_address, int b, int n, float *WA)‏ #pragma css task input (WA[64][64], main_address, b, n) inout (address[1]) void copy_from_shadow_block (float * address, int main_address, int b, int n, float *WA)‏ #pragma css task inout(A[64][64]) highpriority void spotrf_tile(float *A)‏ #pragma css task input (A[64][64]) inout(B[64][64]) void ssyrk_tile(float *A, float *B)‏ #pragma css task input(A[64][64], B[64][64]) inout(C[64][64])‏ void sgemm_tile(float *A, float *B, float *C)‏ #pragma css task input (T[64][64]) inout(B[64][64])‏ void strsm_tile(float *T, float *B)‏ Could be managed as a cache !!!

59 Programming with CellSs CellSs: Programming examples #pragma css task input (address[1], main_address, b, n) output (WA[64][64]) void copy_to_shadow_block (float *address, int main_address, int b, int n, float *WA)‏{ // address is a trick to ensure dependencies // address points to the first element of the block as representantion // of the whole block dmal_h_t *entry; entry = css_gather_2d (WA, main_address, b, b, n, sizeof(float), NULL); ls_sync(entry->tag); } #pragma css task input (WA[64][64], main_address, nb, n) inout (address[1])‏ void copy_from_shadow_block (float * address, int main_address, int b, int n, float *WA) { dmal_h_t *entry; address[0]=WA[0]; // as address is inout, when the task finishes it copies back its local value // to the original position in main memory, so we need to assign the correct // value to that local variable. entry = css_scather_2d (WA, main_address, b, b, n, sizeof(float), NULL); ls_sync(entry->tag); }

60 Programming with CellSs CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } #pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]); void copy_mat (float *Src,float *Dst)‏ {... for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++)‏... copy_block(Src[ii][jj],block);... } #pragma gss task input(A) out(L,U)‏ void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]); void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB])‏ {... for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++){... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]);... } Checking LU

61 Programming with CellSs CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } Checking LU void clean_mat (p_block_t Src[NB][NB])‏ { int ii, jj; for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++)‏ if (Src[ii][jj] != NULL) { free (Src[ii][jj]); Src[ii][jj]=NULL; } #pragma css task output(Dst)‏ void clean_block (float Dst[BS][BS] ); void clean_mat (p_block_t Src[NB][NB])‏ { int ii, jj; for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++)‏ if (Src[ii][jj] != NULL) { clean_block(Src[ii][jj]); }

62 Programming with CellSs CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } Checking LU void sparse_matmult (float *A[NB][NB], float *B[NB][NB], float *C[NB][NB])‏ { int ii, jj, kk; for (ii=0; ii<NB; ii++)‏ for (jj=0; jj<NB; jj++)‏ for (kk=0; kk<NB; kk++)‏ if ((A[ii][kk]!= NULL) && (B[kk][jj] !=NULL )) { if (C[ii][jj] == NULL) C[ii][jj] = allocate_clean_block(); block_matmul (A[ii][kk], B[kk][jj], C[ii][jj]); } #pragma css task input(a,b) inout(c)‏ void block_matmul(float a[BS][BS], float b[BS][BS], float c[BS][BS])‏ { int i, j, k; for (i=0; i<BS; i++)‏ for (j=0; j<BS; j++)‏ for (k=0; k<BS; k++)‏ c[i][j] += a[i][k]*b[k][j]; }

63 Programming with CellSs CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } Checking LU #pragma css task input (ref_block, to_comp) output (mse)‏ void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e); void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop)‏ {... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++)‏ for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL)‏ if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); } #pragma css finish for (ii = 0; ii < NB; ii++)‏ for (jj = 0; jj < NB; jj++)‏ if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii, jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n"); }

64 Programming with CellSs CellSs: Programming examples int main(int argc, char* argv[])‏ {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } Checking LU #pragma css task input (ref_block, to_comp) output (mse)‏ void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e); void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop)‏ {... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++)‏ for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL)‏ if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); } for (ii = 0; ii < NB; ii++)‏ for (jj = 0; jj < NB; jj++)‏ #pragma css wait on (&sq_error[ii][jj])‏ if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii, jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n"); }

65 Programming with CellSs CellSs: Programming examples copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat(A); sparse_matmult (L, U, A); compare_mat (origA, A); Without CellSsWith CellSs (for NB=4 matrix)‏ Behavior Checking LU

66 Programming with CellSs CellSs: Programming examples Behavior Checking LU 0: are_blocks_equal 1: bdiv_adapte 2: block_mpy_add 3: bmod 4: clean_block 5: copy_block 6: fwd 7: lu0 8: split_block

67 Programming with CellSs CellSs: Programming examples Molecular dynamics: Argon simulation Simulates the mobility of Argon atoms in gas state, in a constant volume at T=300K All elestrostatic forces observed for each of the atoms due to the others are considered (F i )‏ The second Newton law is then applied to each atom F i =m*a i The initial velocities are random but reasonable for argon atoms at 300K To maintain a constant temperature in all the process the Berendsen algorithm is applied

68 Programming with CellSs CellSs: Programming examples program argon... !$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii), z(ii), x(jj), y(jj), z(jj), vx(ii), vy(ii), vz(ii))‏ enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))‏ enddo !$CSS BARRIER tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins)‏ do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii), vz(ii), x(ii), y(ii), z(ii))‏ enddo !$CSS FINISH end program argon... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)‏ implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)‏ implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz)‏ implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE)‏ real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface Molecular dynamics: Argon simulation

69 Programming with CellSs CellSs: Programming examples program argon... !$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii), z(ii), x(jj), y(jj), z(jj), vx(ii), vy(ii), vz(ii))‏ enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj))‏ enddo !$CSS BARRIER tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins)‏ do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii), vz(ii), x(ii), y(ii), z(ii))‏ enddo !$CSS FINISH end !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)‏ ! subroutine code end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)‏ ! subroutine code end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz)‏ ! subroutine code end subroutine Molecular dynamics: Argon simulation

70 Programming with CellSs CellSs: Programming examples Vector reduction... Array A BS... NB

71 Programming with CellSs CellSs: Programming examples Vector Reduction int main(int argc, char* argv[])‏ { LEVELS = log2 ((double)NB/BS); #pragma css start for (level = 0 ;level < LEVELS; level++){ range = exp2 ((double)level); for(i=0;i<NB;i+=2*BS*range)‏ block_reduce(&A[i],&A[i+BS*range]); } block_reduce2(&A[0], &reduction); #pragma css finish } #pragma css task input(B[64*64]) inout(A[64*64])‏ void block_reduce(float *A, float *B)‏ { int i; for (i=0; i<BS; i++)‏ A[i] += B[i]; } #pragma css task input(A) output(x)‏ void block_reduce2(float *A, float *x)‏ { int i; *x = 0.0; for (i=0; i<BS; i++)‏ *x += A[i]; }

72 Programming with CellSs CellSs: Programming examples Vector reduction... Array A BS... NB neutral element - Less concurrency for one vector - Fine when considering several

73 Programming with CellSs CellSs: Programming examples Vector Reduction int main(int argc, char* argv[])‏ { LEVELS = log2 ((double)NB/BS); #pragma css start for (i=0; i<NB; i+= BS)‏ block_reduce(&RB[0], &A[i]); block_reduce2(&RB[0], &reduction); #pragma css finish } #pragma css task input(B[64*64]) inout(A[64*64])‏ void block_reduce(float *A, float *B)‏ { int i; for (i=0; i<BS; i++)‏ A[i] += B[i]; } #pragma css task input(A) output(x)‏ void block_reduce2(float *A, float *x)‏ { int i; *x = 0.0; for (i=0; i<BS; i++)‏ *x += A[i]; }

74 Programming with CellSs CellSs: Compiling and running a CellSs application Usage: cellss-cc cellss-cc -help : lists usage Options: Regular compilation flags: -O, -g, -o, -D... Specific compilation flags: -t: tracing enabled. Generates Paraver tracefiles -WPPUp, : passes comma separated list of flags to the PPU preprocessor -WPPUc, : passes comma separated list of flags to the PPU compiler -WPPUl, : passes comma separated list of flags to the PPU linker -WPPUf, : passes comma separated list of flags to the PPU Fortran compiler WSPUp, Passes the comma separated list of options tothe SPU preprocessor. -WSPUc, Passes the comma separated list of options to the SPU compiler. -WSPUf, Passes the comma separated list of options to the SPU Fortran compiler.

75 Programming with CellSs CellSs: Compiling and running a CellSs application Examples > cellss-cc -O3 *.c -o my_binary > cellss-cc -O3 matmul.f90 -o matmul > cellss-cc -O2 -WSPUc,-funroll-loops,-ftree-vectorize -WSPUc,-ftree- vectorizer-verbose=3 matmul.c -o matmul > cellss-cc -O3 -k test.c -o test > cellss-cc -O5 -o argon2 argon2_css.f90 -t

76 Programming with CellSs CellSs: Compiling and running a CellSs application Multiple source files > cellss-cc -O3 -c code1.c > cellss-cc -O3 -c code2.c > cellss-cc -O3 -c code3.f90 > cellss-cc -O3 code1.o code2.o code3.o -o my_binary Use in a Makefile CC = cellss-cc LD = cellss-cc CFLAGS = -O2 -g SOURCES = code1.c code2.c code3.c BINARY = my_binary $(BINARY): $(SOURCES)‏

77 Programming with CellSs CellSs: Compiling and running a CellSs application Running Setting the LD_LIBRARY_PATH (not always needed): export LD_LIBRARY_PATH=$(HOME_CELLSS)/lib:$LD_LIBRARY_PATH Setting the number of SPUS (default 8, valid from 1 to 16 in a blade, from 1 to 6 in a PS3)‏ export CSS_NUM_SPUS=6 Normal execution from command line:./my_binary arg1 arg2... argn

78 Programming with CellSs CellSs: Compiling and running a CellSs application Generating a tracefile Compile with -t flag > cellss-cc my_app.c -t -O3 -o my_binay_instr Run normally >./my_binary_instr arg1 arg2... Tracefile is automatically generated. Default name gss-trace-xxx.ext gss-trace-0001.prv gss-trace-0001.row gss-trace-0001.pcf All three files used by Paraver performance analyser and visualizer Changing the tracefile name: > export CSS_TRACE_FILENAME=tracefilename Will generate tracefiles: tracefilename-0001.prv,...

79 Programming with CellSs CellSs: Compiling and running a CellSs application CellSs configuration file Optional, default settings applied if not provided Plain text file scheduler.min_tasks = 32 scheduler.initial_tasks = 128 scheduler.max_strand_size = 8 task_graph.task_count_high_mark = 2000 task_graph.task_count_low_mark = 1500 renaming.memory_high_mark = 134217728 renaming.memory_low_mark = 104857600

80 Programming with CellSs CellSs: Compiling and running a CellSs application CellSs configuration file scheduler.initial_tasks (128): defines the number of ready for execution tasks that are generated at the beginning of the execution of an application before starting their scheduling and execution in the SPEs scheduler.min_tasks (16): defines minimum number of ready tasks needed to call the scheduler scheduler.max_strand_size (8): defines the maximum number of tasks that are simultaneously scheduled to an SPE task graph.task_count_high_mark (1000): defines the maximum number of non-executed tasks that the graph will hold task graph.task_count_low_mark (900): whevever the task graph reaches the number of tasks defined in the previous variable, the task graph generation is suspended until the number of non-executed tasks goes below this amount

81 Programming with CellSs CellSs: Compiling and running a CellSs application CellSs configuration file renaming.memory_high_mark (∞): defines the maximum amount of memory used for renaming in bytes. renaming.memory_low_mark (1): whenever the renaming memory usage reaches the size specified in the previous variable, the task graph generation is suspended until the renaming memory usage goes below the number of bytes specified in this variable. > export CSS_CONFIG_FILE=file.cfg

82 Programming with CellSs CellSs: Performance Analysis with Paraver Paraver Flexible performance visualization and analysis tool that can be used to analyze: MPI, OpenMP, MPI+OpenMP Java Hardware counters profile Operating system activity... and many other things you may think of Generally it uses external trace file generators. Example for MPI: > mpitrace mpirun -n 10 my_mpi-binary For CellSs, the libraries have been instrumented. When installing the distribution, two libraries are generated: normal and instrumented Flag -t links with instrumented version Available for free from the BSC website: www.bsc.es/paraver

83 Programming with CellSs CellSs: Performance Analysis with Paraver Running paraver paraver tracefile-0001.prv

84 Programming with CellSs CellSs: Performance Analysis with Paraver Configuration files

85 Programming with CellSs CellSs: Performance Analysis with Paraver Configuration files

86 Programming with CellSs CellSs: Performance Analysis with Paraver Clustering Group of 8 tasks (23 us)‏ Block size: 64x64 floats DMA in/out Data re-use Main thread Helper thread

87 Programming with CellSs CellSs: Performance Analysis with Paraver Another Cholesky

88 Programming with CellSs CellSs: Performance evolution Performance: matrix multiply Versions with different task implementation Task duration: from 2000 µsecs (simple C scalar code)‏ to 22 µsecs (highly hand-vectorized/optimized code) July 2007 November 2007 April 2007

89 Programming with CellSs CellSs: Performance evolution Performance: Cholesky factorization April 2007 July 2007 November 2007

90 Programming with CellSs CellSs: Performance evolution Task dependence graph for a 320 x 320 floats matrix (blocks of 64 x 64)‏

91 Programming with CellSs CellSs: Performance evolution SXU LS DMA On-chip coherent bus SL1... PPE Memory controller SXU LS DMA

92 Programming with CellSs CellSs: Performance evolution Increase of locality for Matmul

93 Programming with CellSs CellSs: Performance evolution Increase of locality for Cholesky

94 Programming with CellSs CellSs: Performance evolution Increase of locality for Sparse LU

95 Programming with CellSs CellSs: Performance evolution Increase of locality in the software cache

96 Programming with CellSs CellSs: Performance evolution Increase of locality in the software cache

97 Programming with CellSs CellSs: issues and ongoing efforts CellSs programming model Memory association Array regions Subobject accesses Blocks larger than Local Store. Access to global memory by tasks? Inline directives CellSs runtime system Further optimization of overheads (insert task and remove task), scheduling algorithms: overhead, locality overlays Short circuiting (SPE  SPE transfers)‏ SMP superscalar (SMPSs)‏

98 Programming with CellSs Outline CellSs StarSs Programming Model CellSs syntax CellSs compiler CellSs runtime Installing CellSs Programming examples Compiling and running a CellSs application Performance analysis using Paraver SMPSs Conclusions

99 Programming with CellSs SMPSs “Same” source code Higher flexibility (block size,... Same compiler Different back-end Execution environment Specific implementation Distributed scheduling No need for data copy 2 way POWER 5 SGI Altix

100 Programming with CellSs SMPSs: Programming example (version array regions) Merge-sort Splits in 4 subarrays each time Sorts de arrays later on, calling a recursive sort to avoid sorting big arrays Using array regions #pragma css task input(V[N]{i..j}) output (M[N][N]{i}{0..N-1})‏

101 Programming with CellSs SMPSs: Programming example (version array regions) #pragma css task input(low[N]{i1..j1}, low[N]{i2..j2},i1, j1, i2, j2) output (dest[N]{i1..j2})‏ void seqmerge (ELM *low, long i1, long j1, long i2, long j2, ELM *dest); #pragma css task inout (low[N]{i..j}) input (i,j)‏ void seqquick (ELM *low, long i, long j); void sort (ELM *low, long i, long j){... if (size < QUICKSIZE) { seqquick (low, i, j); }else{ quarter = size / 4; i1= i; j1 = i+quarter-1; i2 = i+quarter; j2 = i+2*quarter-1; i3 = i+2*quarter; j3 = i+3*quarter-1; i4 = i+3*quarter; j4 = j; sort(low, i1, j1); sort(low, i2, j2); sort(low, i3, j3); sort(low, i4, j4); merge(low, i1, j1, i2, j2, tmp); merge(low, i3, j3, i4, j4, tmp); merge(tmp, i1, j2, i3, j4, low); }

102 Programming with CellSs SMPSs: Programming example (version array regions) void merge (ELM *low, long i1, long j1, long i2, long j2, ELM *dest){... if (size < MERGESIZE) { seqmerge(low1, i1, j1, i2, j2, dest ); return; } size /= 2;... split(low, i1, j1, i2, j2, &split1, &split2); merge(low, i1, split1-1, i2, split2-1, dest); merge(low, split, j1, split2, j2, dest ); } main (){ #pragma css start sort(&array, 0, size-1); #pragma css barrier }

103 Programming with CellSs SMPSs: Programming example Queens Find a solution to the problem of locating N queens on an N N board, with any of them killing each other

104 Programming with CellSs SMPSs : Programming example #pragma css task input (j, i,n) inout (a[n]) highpriority void add_queen_task(char *a, int j, int i, int n); #pragma css task input (results) inout (acc) highpriority void acumulate(int results, int *acc); #pragma css task input (n, j, a[n]) output (results)‏ void nqueens_ser_task(int n, int j, char *a, int *results); void nqueens(int n, int j, char *a, char *b, int depth) { for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { add_queen_task(b, j, i, n); if (depth < task_depth) { nqueens(n, j + 1, a, b, depth + 1); } else { nqueens_ser_task(n, j + 1, b, &results); acumulate(results, &total_res); }

105 Programming with CellSs SMPss: Compiler phase Code translation (mcc)‏ smpss-cc_app.c pack C compiler (gcc, icc,...)‏ app.tasks (tasks list)‏ app.c smpss-cc_app.o app.o SMPSS-CC

106 Programming with CellSs smpss-cc-app.c SMPss: Linker phase app.c unpack smpss-cc-app.c app-adapters.c execlibSMPSS.so Linker glue code generator app.c app.o app.tasks exec-adapters.c app-adapters.cc smpss-cc_app.o C compiler (gcc, icc,...)‏ exec-registration.c exec-adapters.o exec-registration.o SMPSS-CC

107 Programming with CellSs SMPss: runtime

108 Programming with CellSs SMPss: results Multi sort N queens Benchmarks used for OpenMP 3.0 development Similar performance in some ranges Overlap potential in SMPSs Programmability issues Reductions, memory allocations, synchronization representatives, nesting,…

109 Programming with CellSs SMPss: results

110 Programming with CellSs Outline CellSs StarSs Programming Model CellSs syntax CellSs compiler CellSs runtime Installing CellSs Programming examples Compiling and running a CellSs application Performance analysis using Paraver SMPSs Conclusions

111 Programming with CellSs Conclusions The road for new chips with multi and many cores is open New programming models that can deal with the complexity of the hardware are now more needed than ever StarSs Simple Portable Enough performance Ported to different architectures: CellSs, SMPSs

112 Programming with CellSs CellSs and SMPSs websites CellSs www.bsc.es/cellsuperscalar SMPSs www.bsc.es/smpsuperscalar Both available for download (open source, GPL and LGPL)‏


Download ppt "Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,"

Similar presentations


Ads by Google