CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)

Slides:

Advertisements

Similar presentations

Barcelona Supercomputing Center. The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences. Supercomputing support to external.

Advertisements

Threads Relation to processes Threads exist as subsets of processes Threads share memory and state information within a process Switching between threads.

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics.

1 Presenter: Chien-Chih Chen. 2 Dynamic Scheduler for Multi-core Systems Analysis of The Linux 2.6 Kernel Scheduler Optimal Task Scheduler for Multi-core.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

Computer Organization and Architecture

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

1 Lecture 6 Performance Measurement and Improvement.

Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Thread. A basic unit of CPU utilization. It comprises a thread ID, a program counter, a register set, and a stack. It is a single sequential flow of control.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade,

Programming with CellSs BSC. ScicomP15, Cell tutorial, May 18 th 2009 Outline StarSs Programming Model CellSs runtime CellSs syntax CellSs compiler Programming.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

Chapter One Introduction to Pipelined Processors.

Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.

Challenge the future Delft University of Technology Programming Models for multi-cores Ana Lucia Varbanescu TUDelft / Vrije Universiteit Amsterdam Programming.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. C H A P T E R F I V E Memory Management.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

ServiceSs, a new programming model for the Cloud Daniele Lezzi, Rosa M. Badia, Jorge Ejarque, Raul Sirvent, Enric Tejedor Grid Computing and Clusters Group.

09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

CS307 Operating Systems Threads Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Spring 2011.

Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

CMSC 421 Spring 2004 Section 0202 Part II: Process Management Chapter 5 Threads.

Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.

Chapter Overview General Concepts IA-32 Processor Architecture

Lecture 5: Shared-memory Computing with Open MP

5.2 Eleven Advanced Optimizations of Cache Performance

Simultaneous Multithreading in Superscalar Processors

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Introduction to parallelism and the Message Passing Interface

Hybrid Parallel Programming

Programming with Shared Memory

Processes in Unix and Windows

Large data arrays processing on Cell Broadband Engine

Hybrid Parallel Programming

Presentation transcript:

CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS) Technical University of Catalonia (UPC)

Index Motivation Programming models CellSs sample codes Compilation environment Execution behavior Results Related work Conclusions & ongoing work

Motivation * Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper

Motivation User point of view So, what is the Cell BE? Architecture point of view SPE PPESPE Separate address spaces Tiny local memory Bandwidth Thin processor SMT Hard to optimize Programmers point of view

 ns  100 useconds  minutes/hours Programming models Grid Concepts mapping: Instructions  Block operations  Full binary Functional units  SPEs  remote machines Fetch &decode unit  PPE  local machine Registers (name space )  Main memory  Files Registers (storage)  SPU memory  Files Standard sequential languages: On standard processors run sequential On Cell runs parallel Constraint Block algorithms

CellSs sample code: Matrix multiply int main(int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]);... } static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; } B B NB B B

CellSs sample code: Matrix multiply int main (int argc, char **argv) { int i, j, k; … initialize(A, B, C); for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]);... } #pragma css task input(A, B) inout(C) static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { int i, j, k; for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j]; }  SPE  unroll B B NB B B

CellSs sample code: Sparse LU int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } void lu0(float *diag); void bdiv(float *diag, float *row); void bmod(float *row, float *col, float *inner); void fwd(float *diag, float *col); B B NB B B

CellSs sample code: Sparse LU int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } #pragma css task inout(diag[B][B]) void lu0(float *diag); #pragma css task input(diag[B][B]) inout(row[B][B]) void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B]) void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B]) void fwd(float *diag, float *col); B B NB B B

Data dependent parallelism CellSs sample code: Sparse LU int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } #pragma css task inout(diag[B][B]) void lu0(float *diag); #pragma css task input(diag[B][B]) inout(row[B][B]) void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B]) void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B]) void fwd(float *diag, float *col); B B NB B B

Dynamic main memory allocation Data dependent parallelism CellSs sample code: Sparse LU int main(int argc, char **argv) { int ii, jj, kk; … for (kk=0; kk<NB; kk++) { lu0(A[kk][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) fwd(A[kk][kk], A[kk][jj]); for (ii=kk+1; ii<NB; ii++) if (A[ii][kk] != NULL) { bdiv (A[kk][kk], A[ii][kk]); for (jj=kk+1; jj<NB; jj++) if (A[kk][jj] != NULL) { if (A[ii][jj]==NULL) A[ii][jj]=allocate_clean_block(); bmod(A[ii][kk], A[kk][jj], A[ii][jj]); } #pragma css task inout(diag[B][B]) void lu0(float *diag); #pragma css task input(diag[B][B]) inout(row[B][B]) void bdiv(float *diag, float *row); #pragma css task input(row[B][B],col[B][B]) inout(inner[B][B]) void bmod(float *row, float *col, float *inner); #pragma css task input(diag[B][B]) inout(col[B][B]) void fwd(float *diag, float *col); B B NB B B

CellSs sample code: Checking LU int main(int argc, char* argv[]) {... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A); } #pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]); void copy_mat (float *Src,float *Dst) {... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++)... copy_block(Src[ii][jj],block);... } #pragma gss task input(A) out(L,U) void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]); void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB]) {... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++){... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]);... }

Compilation environment app.c CSS compiler app_spe.c app_ppe.c llib_css-spe.so Cell executable llib_css-ppe.so SPE Linker PPE Linker SPE executable SPE Compiler app_spe.o PPE Compiler app_ppe.o SPE Embedder SPE Linker PPE Object SDK

Execution behavior PPU User main program CellSs PPU lib SPU 0 DMA in Task execution DMA out Synchronization CellSs SPU lib Original task code Helper thread main thread Memory User data Renaming Task graph Synchronization Tasks Finalization signal Stage in/out data Work assignment Data dependence Data renaming Scheduling SPU 1 SPU 2

Execution behavior: Matrix multiply... #pragma css task input(A, B) inout(C) block_addmultiply( C[i][j], A[i][k], B[k][j]) C[i][j] A[i][k]B[k][j] For each operation, two blocks of data are get from PPE memory to SPE local storage Clusters of dependent tasks are scheduled to the same PPE The inout block is kept in the local storage and only put in PPE memory once (reuse)

Execution behavior: Matrix multiply Clustering Chain of 7 block multiply (270 us) Size of block: 64x64 floats Stage in/out Reuse Main thread: task generation Helper thread

Execution behavior: Matrix multiply Waiting for SPE availability Schedule & dispatch

Execution behavior: Matrix multiply Stage out and notification Task generation DispatchSchedule Graph update

Execution behavior: Sparse LU Priority hints #pragma css task highpriority … Increase parallelism / support scheduling Support reuse

Execution behavior: J_Check_LU copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat(A); sparse_matmult (L, U, A); compare_mat (origA, A); Without CellSsWith CellSs...

Execution behavior: J_Check

Execution behavior: Other views Stage in bandwidth Stage out bandwidth Task generation lookahead Full unrolling before execution Overlaped generation/execution

Scalability Faster tasks (pre-fetching data)

Related work Sequoia Just presented! Charm++ Runtime tailored to Cell BE Offload API Octopiler (IBM) Auto-SIMDization OpenMP as programming model Single shared-memory abstraction

Conclusions & Ongoing work Cell Superscalar offers a simple programmer model for the Cell BE Allows easy porting of applications General Constraints: Blocking Ongoing work Run Time optimization: overheads, halos, scheduling algs, overlap phases, overlays, speculation, short-circuits, more helper threads, lazy renaming, … Garbage collection Applications Bio Engineering To be distributed as open source soon

THANKS! Visit us at BSC booth #1800 for further information