Presentation is loading. Please wait.

Presentation is loading. Please wait.

CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Similar presentations


Presentation on theme: "CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)"— Presentation transcript:

1 CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS) Technical University of Catalonia (UPC) rosa.m.badia@bsc.es

2 Index Motivation Programming models CellSs sample codes Compilation environment Execution behavior Results Related work Conclusions & ongoing work

3 Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper

4 Motivation User point of view So, what is the Cell BE? Architecture point of view SPE PPESPE Separate address spaces Tiny local memory Bandwidth Thin processor SMT Hard to optimize Programmers point of view

5  ns  100 useconds  minutes/hours Programming models Grid Concepts mapping: Instructions  Block operations  Full binary Functional units  SPEs  remote machines Fetch &decode unit  PPE  local machine Registers (name space )  Main memory  Files Registers (storage)  SPU memory  Files Standard sequential languages: On standard processors run sequential On Cell runs parallel Constraint Block algorithms

6 CellSs sample code: Matrix multiply int main(int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]);... } static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; } B B NB B B

7 CellSs sample code: Matrix multiply int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]);... } #pragma css task input(A, B) inout(C) static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; }  SPE  unroll B B NB B B

8 CellSs sample code: Sparse LU int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } void lu0(float *diag); void bdiv(float *diag, float *row); void bmod(float *row, float *col, float *inner); void fwd(float *diag, float *col); B B NB B B

9 CellSs sample code: Sparse LU int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } #pragma css task inout(diag[B][B]) void lu0(float *diag); #pragma css task input(diag[B][B]) inout(row[B][B]) void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B]) void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B]) void fwd(float *diag, float *col); B B NB B B

10 Data dependent parallelism CellSs sample code: Sparse LU int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } #pragma css task inout(diag[B][B]) void lu0(float *diag); #pragma css task input(diag[B][B]) inout(row[B][B]) void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B]) void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B]) void fwd(float *diag, float *col); B B NB B B

11 Dynamic main memory allocation Data dependent parallelism CellSs sample code: Sparse LU int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } #pragma css task inout(diag[B][B]) void lu0(float *diag); #pragma css task input(diag[B][B]) inout(row[B][B]) void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B]) void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B]) void fwd(float *diag, float *col); B B NB B B

12 CellSs sample code: Checking LU int main(int argc, char* argv[]) {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } #pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]); void copy_mat (float *Src,float *Dst) {... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++)... copy_block(Src[ii][jj],block);... } #pragma gss task input(A) out(L,U) void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]); void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB]) {... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++){... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]);... }

13 Compilation environment app.c CSS compiler app_spe.c app_ppe.c llib_css-spe.so Cell executable llib_css-ppe.so SPE Linker PPE Linker SPE executable SPE Compiler app_spe.o PPE Compiler app_ppe.o SPE Embedder SPE Linker PPE Object SDK

14 Execution behavior PPU User main program CellSs PPU lib SPU 0 DMA in Task execution DMA out Synchronization CellSs SPU lib Original task code Helper thread main thread Memory User data Renaming Task graph Synchronization Tasks Finalization signal Stage in/out data Work assignment Data dependence Data renaming Scheduling SPU 1 SPU 2

15 Execution behavior: Matrix multiply... #pragma css task input(A, B) inout(C) block_addmultiply( C[i][j], A[i][k], B[k][j]) C[i][j] A[i][k]B[k][j] For each operation, two blocks of data are get from PPE memory to SPE local storage Clusters of dependent tasks are scheduled to the same PPE The inout block is kept in the local storage and only put in PPE memory once (reuse)

16 Execution behavior: Matrix multiply Clustering Chain of 7 block multiply (270 us) Size of block: 64x64 floats Stage in/out Reuse Main thread: task generation Helper thread

17 Execution behavior: Matrix multiply Waiting for SPE availability Schedule & dispatch

18 Execution behavior: Matrix multiply Stage out and notification Task generation DispatchSchedule Graph update

19 Execution behavior: Sparse LU Priority hints #pragma css task highpriority … Increase parallelism / support scheduling Support reuse

20 Execution behavior: J_Check_LU copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat(A); sparse_matmult (L, U, A); compare_mat (origA, A); Without CellSsWith CellSs...

21 Execution behavior: J_Check

22 Execution behavior: Other views Stage in bandwidth Stage out bandwidth Task generation lookahead Full unrolling before execution Overlaped generation/execution

23 Scalability Faster tasks (pre-fetching data)

24 Related work Sequoia Just presented! Charm++ Runtime tailored to Cell BE Offload API Octopiler (IBM) Auto-SIMDization OpenMP as programming model Single shared-memory abstraction

25 Conclusions & Ongoing work Cell Superscalar offers a simple programmer model for the Cell BE Allows easy porting of applications General Constraints: Blocking Ongoing work Run Time optimization: overheads, halos, scheduling algs, overlap phases, overlays, speculation, short-circuits, more helper threads, lazy renaming, … Garbage collection Applications Bio Engineering To be distributed as open source soon

26 THANKS! Visit us at BSC booth #1800 for further information


Download ppt "CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)"

Similar presentations


Ads by Google