NORA/Clusters AMANO, Hideharu Textbook pp. １４０－１４７.

NORA/Clusters AMANO, Hideharu Textbook pp. １４０－１４７

NORA (No Remote Access Memory Model) No hardware shared memory Data exchange is done by messages (or packets) Dedicated synchronization mechanism is provided. High peak performance Message passing library （ MPI,PVM) is provided.

Message passing （ Blocking: randezvous ） Send Receive Send Receive

Message passing （ with buffer ） Send Receive Send Receive

Message passing （ non-blocking ） Send Receive Other Job

PVM (Parallel Virtual Machine) A buffer is provided for a sender. Both blocking/non-blocking receive is provided. Barrier synchronization

MPI (Message Passing Interface) Superset of the PVM for 1 to 1 communication. Group communication Various communication is supported. Error check with communication tag. Detail will be introduced later.

Programming style using MPI SPMD (Single Program Multiple Data Streams)  Multiple processes executes the same program.  Independent processing is done based on the process number. Program execution using MPI  Specified number of processes are generated.  They are distributed to each node of the NORA machine or PC cluster.

Communication methods Point-to-Point communication  A sender and a receiver executes function for sending and receiving.  Each function must be strictly matched. Collective communication  Communication between multiple processes.  The same function is executed by multiple processes.  Can be replaced with a sequence of Point-to-Point communication, but sometimes effective.

Fundamental MPI functions Most programs can be described using six fundamental functions  MPI_Init() … MPI Initialization  MPI_Comm_rank() … Get the process #  MPI_Comm_size() … Get the total process #  MPI_Send() … Message send  MPI_Recv() … Message receive  MPI_Finalize() … MPI termination

Other MPI functions Functions for measurement  MPI_Barrier() … barrier synchronization  MPI_Wtime() … get the clock time Non-blocking function  Consisting of communication request and check  Other calculation can be executed during waiting.

An Example 1: #include 2: #include 3: 4: #define MSIZE 64 5: 6: int main(int argc, char **argv) 7: { 8: char msg[MSIZE]; 9: int pid, nprocs, i; 10: MPI_Status status; 11: 12: MPI_Init(&argc, &argv); 13: MPI_Comm_rank(MPI_COMM_WORLD, &pid); 14: MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 15: 16: if (pid == 0) { 17: for (i = 1; i < nprocs; i++) { 18: MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status); 19: fputs(msg, stdout); 20: } 21: } 22: else { 23: sprintf(msg, "Hello, world! (from process #%d)\n", pid); 24: MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD); 25: } 26: 27: MPI_Finalize(); 28: 29: return 0; 30: }

Initialize and Terminate int MPI_Init( int *argc, /* pointer to argc */ char ***argv /* pointer to argv */ ); mpi_init(ierr) integer ierr ! return code The attributes from command line must be passed directly to argc and argv. int MPI_Finalize(); mpi_finalize(ierr) integer ierr ! return code

Commincator functions It returns the rank (process ID) in the communicator comm. int MPI_Comm_rank( MPI_Comm comm, /* communicator */ int *rank /* process ID (output) */ ); mpi_comm_rank(comm, rank, ierr) integer comm, rank integer ierr ! return code It returns the total number of processes in the communicator comm. int MPI_Comm_size( MPI_Comm comm, /* communicator */ int *size /* number of process (output) */ ); mpi_comm_size(comm, size, ierr) integer comm, size integer ierr ! return code Communicators are used for sharing commnication space among a subset of processes. MPI_COMM_WORLD is pre-defined one for all processes.

MPI_Send It sends data to process “dest”. int MPI_Send( void *buf, /* send buffer */ int count, /* # of elements to send */ MPI_Datatype datatype, /* datatype of elements */ int dest, /* destination (receiver) process ID */ int tag, /* tag */ MPI_Comm comm /* communicator */ ); mpi_send(buf, count, datatype, dest, tag, comm, ierr) buf(*) integer count, datatype, dest, tag, comm integer ierr ! return code Tags are used for identification of message.

MPI_Recv int MPI_Recv( void *buf, /* receiver buffer */ int count, /* # of elements to receive */ MPI_Datatype datatype, /* datatype of elements */ int source, /* source (sender) process ID */ int tag, /* tag */ MPI_Comm comm, /* communicator */ MPI_Status /* status (output) */ ); mpi_recv(buf, count, datatype, source, tag, comm, status, ierr) buf(*) integer count, datatype, source, tag, comm, status(mpi_status_size) integer ierr ! return code The same tag as the sender’s one must be passed to MPI_Recv. Set the pointers to a variable MPI_Status. It is a structure with three members: MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the sender, tag and error code.

datatype and count The size of the message is identified with count and datatype.  MPI_CHAR char  MPI_INT int  MPI_FLOAT float  MPI_DOUBLE double … etc.

Compile and Execution % icc –o hello hello.c -lmpi % mpirun –np 8./hello Hello, world! (from process #1) Hello, world! (from process #2) Hello, world! (from process #3) Hello, world! (from process #4) Hello, world! (from process #5) Hello, world! (from process #6) Hello, world! (from process #7)

Shared memory model vs ． Message passing model Benefits  Distributed OS is easy to implement.  Automatic parallelize compiler.  OpenMP Message passing  Formal verification is easy (Blocking)  No-side effect (Shared variable is side effect itself)  Small cost

OpenMP #include int main() { pragma omp parallel { int tid, npes; tid = omp_get_thread_num(); npes = omp_get_num_threads(); printf(“Hello World from %d of %d\n”, tid, npes) } return 0; } Multiple threads are generated by using pragma. Variables declared globally can be shared.

Convenient pragma for parallel execution #pragma omp parallel { #pragma omp for for (i=0; i<1000; i++){ c[i] = a[i] + b[i]; } The assignment between i and thread is automatically adjusted in order that the load of each thread becomes even.

Automatic parallelizing Compilers Automatically translating a code for uniprocessors into multiprocessors. Loop level parallelism is main target of parallelizing. Fortran codes have been main targets  No pointers  The array structure is simple Recently, restricted C becomes a target language Oscar Compiler (Waseda Univ.), COINS

PC Clusters Multicomputers  Dedicated hardware (CPU, network)  High performance but expensive  Hitachi’s SR8000, Cray T3E, etc. WS/PC Clusters  Using standard CPU boards  High Performance/Cost  Standard bus often forms a bottleneck Beowluf Clusters  Standard CPU boards, Standard components  LAN+TCP/IP  Free-software  A cluster with Standard System Area Network(SAN) like Myrinet is often called Beowulf Cluster

PC clusters in supercomputing Clusters occupies more than 80% of top 500 supercomputers in 2008/11 Let’s check http://www.top500.org

SAN (System Area Network) for PC clusters Virtual Cut-through routing High throughput/Low latency Out of the cabinet but in the floor Also used for connecting disk subsystems  Sometimes called System Area Network  Infiniband  Myrinet  Quadrics  GBEthernet: 10GB Ethernet Store & Forward Tree based topologies

SAN vs. Ethernet

Infiniband Point-to-point direct serial interconnection. Using 8b/10b code. Various types of topologies can be supported. Multicasting/atomic transactions are supported. The maximum throughput SDRDDRQDR 1X2Gbit/s4Gbit/s8Gbit/s 4X8Gbit/s16Gbit/s32Gbit/s 12X24Gbit/s48Gbit/s96Gbit/s

Remote DMA (user level) User Kernel Host I/F Sender Buffer Data Source Kernel Agent Protocol Engine Local Node Network Interface Buffer Data Sink Protocol Engine Remote Node Network Interface System Call RDMA User Level

PC Clusters with Myrinet (RWCP, using Myrinet)

RHiNET Cluster Node CPU: Pentium III 933MHz CPU: Pentium III 933MHz Memory: 1Gbyte Memory: 1Gbyte PCI bus: 64bit/66MHz PCI bus: 64bit/66MHz OS: Linux kernel 2.4.18 OS: Linux kernel 2.4.18 SCore: version 5.0.1 SCore: version 5.0.1 RHiNET-2 with 64 nodes Network Optical → Gigabit Ethernet

Cluster of supercomputers Recent trend of supercomputers is connecting powerful components with high speed SANs like Infiniband Roadrunner by IBM  General purpose CPU + Cell BE

Grid Computing Supercomputers which are connected with Internet can be treated virtually as a single big supercomputer. Using middleware or toolkit to manage it.  Globus toolkit GEO (Global Earth Observation Grid)  AIST Japan

Contest/Homework Assume that there is an array x[100]. Write the MPI code for computing sum of all elements of x with 8 processes.

NORA/Clusters AMANO, Hideharu Textbook pp. １４０－１４７.

Similar presentations

Presentation on theme: "NORA/Clusters AMANO, Hideharu Textbook pp. １４０－１４７."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NORA/Clusters AMANO, Hideharu Textbook pp. １４０－１４７.

Similar presentations

Presentation on theme: "NORA/Clusters AMANO, Hideharu Textbook pp. １４０－１４７."— Presentation transcript:

Similar presentations

About project

Feedback