Presentation is loading. Please wait.

Presentation is loading. Please wait.

TM Message Passing MPI on Origin Systems. TM MPI Programming Model.

Similar presentations


Presentation on theme: "TM Message Passing MPI on Origin Systems. TM MPI Programming Model."— Presentation transcript:

1 TM Message Passing MPI on Origin Systems

2 TM MPI Programming Model

3 TM Compiling MPI Programs cc -64 compute.c -lmpi f77 -64 -LANG:recursive=on compute.f -lmpi f90 -64 -LANG:recursive=on compute.f -lmpi CC -64 compute.c -lmpi++ -lmpi -64 NOT required but improves functionality and optimization With 7.2.1 compiler level or higher, can use: -auto_use mpi_interface with f77 / f90 for compile time subroutine interface checking

4 TM Compiling MPI Programs Must use header file from /usr/include since SGI libraries built with it (do not use public domain version) –FORTRAN: mpif.h or USE MPI –C: mpi.h –C++: mpi++.h mpi_init version must match main program language (if called from multiple shared memory threads must use mpi_init_thread)

5 TM Compiling MPI Programs MPI definitions: – FORTRAN: MPI_XXXX (not case sensitive) – C: MPI_Xxxx (upper and lower case) – C++: Xxxx (part of name space MPI::) Every entry point MPI_ in the MPI Library has a “shadow” entry point PMPI_ to aid with implementation of user profiling Array Services required to run MPI (arrayd)

6 TM Basic MPI Features

7 TM Basic MPI Features

8 TM Basic MPI Features

9 TM MPI Basic Calls MPI has a large number of calls. The following are most basic: every MPI program has to start and finish with these calls (the first and the last executable statements): mpi_init mpi_finalize essential inquiry about the environment: mpi_comm_size mpi_comm_rank basic communication calls: mpi_send mpi_recv basic synchronization calls: mpi_barrier Program mpitest include “mpif.h” call mpi_init(ierr) call mpi_comm_size(MPI_COMM_WORLD,np,ierr) call mpi_comm_rank(MPI_COMM_WORLD,id,ierr) do I=0,np-1 if(I.eq.id) print *,’np, id’,np,id call mpi_barrier(MPI_COMM_WORLD,ierr) enddo call mpi_finalize(ierr) stop end Compile with: f77 -o mpitest -LANG:recursive=on mpitest.f -lmpi run with: mpirun -np N [-stats -prefix “%g”] mpitest

10 TM MPI send and receive Calls mpi_send(buf,count,datatype,dest,tag,comm,ierr) mpi_recv(buf,count,datatype,dest,tag,comm,stat,ierr) buff data to be send/recv count number of items to be send; size of buf for recv datatype type of data items to send/recv (MPI_INTEGER, MPI_FLOAT, MPI_DOUBLE_PRECISION, etc.) dest id of the pear process (MPI_ANY_SOURCE) tag integer mark of the message (MPI_ANY_TAG) comm communication handle (MPI_COMM_WORLD) stat status of the message of MPI_STATUS type; in Fortran INTEGER stat(MPI_STATUS_SIZE) call mpi_get_count(stat,MPI_REAL,nitems) where nitems can be <= count check for errors: if(ierr.ne.MPI_SUCCESS) call abort() message envelope

11 TM Using send and receive Calls Example: rules of use: mpi_send/recv are defined as blocking calls –the program should not assume blocking behaviour (small messages are buffered) –when these calls return, the buffers can be (re-)used the arrival order of messages send from A and B to C is not determined. Two messages from A to B will arrive in the order sent. Message Passing programming models are non-deterministic. If(mod(myid,2).eq.0) then idst = mod(id+1,np) itag = 0 call mpi_send(A,N,MPI_REAL,idst,itag,MPI_COMM_WORLD,ierr) if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierr else isrc = mod(id-1+np,np) itag = MPI_ANY_TAG call mpi_recv(B,NSIZE,MPI_REAL,isrc,itag,MPI_COMM_WORLD,stat,ierr) if(ierr.ne.MPI_SUCCESS) print *,’error from’,id,np,ierr call mpi_get_count(stat,MPI_REAL,N) endif

12 TM Another Simple Example

13 TM MPI send/receive: Buffering MPI program should not assume buffering of messages. The following program is erroneous: running on Origin2000 on 2 cpu the program will block after reaching the size i=2100 (because the buffering constraint MPI_BUFFER_MAX=16384, I.e. 2048 items of real*8) Program long_messages include ‘mpif.h’ real*8 h(4000) integer stat(MPI_STATUS_SIZE) call mpi_init(info) call mpi_comm_rank(MPI_COMM_WORLD, mype, info) call mpi_comm_size(MPI_COMM_WORLD, npes, info) do I = 1000, 4000, 100 ! Increasing size of the message call mpi_barrier(MPI_COMM_WORLD,info) print *,’mype=‘,mype,’ before send’,I call mpi_send(h,I,mpi_real8,mod(mype+1,npes),I,MPI_COMM_WORLD,info) call mpi_barrier(MPI_COMM_WORLD,info) call mpi_recv(h,I,MPI_REAL8, MOD(mype-1+npes,npes),I,MPI_COMM_WORLD,stat,info) enddo call mpi_finalize(info) END

14 TM MPI Asynchronous send/receive Non-blocking send and receive calls are available: mpi_isend(buf,count,datatype,dest,tag,comm,req,ierr) mpi_irecv(buf,count,datatype,dest,tag,comm,req,ierr) buf,count,datatype message content dest,tag,comm message envelope req integer holding the request id the asynchronous call returns the request-id after registering the buffer. The request id can be used in the probe and wait calls: mpi_wait(req,stat,ierr) blocks until the MPI send or receive with req request-id completes mpi_waitall(count,array-of-req,array-of-stat,ierr) waits for all given communications to complete (a blocking call) the (array-of-)stat can be probed for items received. The data can be retrieved with the recv call (or irecv call, or any other variety receive) NOTE: although this interface announces asynchronous communication, the actual copy of buffers happens only at the time of the receive and wait calls

15 TM MPI Asynchronous: Example Buffer management with asynchronous communcation: buffers declared in isend/irecv can be (re-)used only after the communication has actually completed. Requests should be freed ( mpi_test, mpi_wait, mpi_request_free ) for all the isend calls in the program, otherwise mpi_finalize might hang include ‘mpif.h’ integer stat(MPI_STATUS_SIZE,10) integer req(10) real B1(NB1,10) if(mype.eq.0) then ! Master receiving from all slaves do ip=1,npes-1 call mpi_irecv(B1(ip),NB1,MPI_REAL, ip,MPI_ANY_TAG,MPI_COMM_WORLD,req(ip),info) enddo nreq = npes else ! Slave send to master call mpi_isend(B1(mype),NB1,MPI_REAL,0,itag,MPI_COMM_WORLD,req,info) nreq = 1 endif … ! Some unrelated calculations call mpi_waitall(nreq,req,stat,ierr) … ! Data is available in B1 in the master process … ! Buffer B1 can be reused in the slave processes

16 TM Performance of Asynchronous Communication

17 TM MPI Functionality

18 TM MPI Most Important Functions Synchronous communication: mpi_send mpi_send mpi_recv mpi_recv mpi_sendrecv mpi_sendrecv Asynchronous communication: mpi_isend mpi_isend mpi_irecv mpi_irecv mpi_iprobe mpi_wait/waitall mpi_wait/waitall Collective communication: mpi_barrier mpi_barrier mpi_bcast mpi_bcast mpi_gather/scatter mpi_gather/scatter mpi_reduce/allreduce mpi_reduce/allreduce mpi_alltoall mpi_alltoall Creating communicators: mpi_comm_dup mpi_comm_split mpi_comm_free mpi_intercomm_create Derived data types: mpi_type_contiguous mpi_type_vector mpi_type_indexed mpi_type_pack mpi_type_commit mpi_type_free

19 TM MPI Most Important Functions One-sided communication: mpi_win_create mpi_win_create mpi_put mpi_put mpi_get mpi_get mpi_fence mpi_fence Miscellaneous: MPI_Wtime() MPI_Wtime() Based on SGI_CYCLE clock with 0.8 microsecond resolution

20 TM On SGI, all MPI programs are launched with the mpirun command –mpirun -np N executable-name arguments syntax on a single host –multi-host execution of different executables is possible The mpirun establishes connection with the Array Daemon with the socket interface. The Array Daemon launches the mpi executable. N+1 threads are started. One additional thread is the “lazy” thread which is blocked in mpi_init() call and terminates when all other threads call mpi_finalize() The mpirun -cpr (or -miser ) will work on the single host to avoid the socket interface to the Array Daemon (for Checkpoint/Restart facility) Note: start MPI programs with N < #procs MPI Run Time System on SGI Array daemon N0N-1 fork() t.exe N times Array daemon N0N-1 fork() t.exe N times mpirun -np N t.exe mpirun Host_A -np N a.out : Host_B -np M b.out Program name, path, environement variables HiPPI optimized communication

21 TM MPI Run Time on SGI

22 TM MPI Run Time on SGI

23 TM MPI Run Time on SGI

24 TM MPI Implementation on SGI In C, mpi_init ignores all arguments passed to it All MPI processes are required to call mpi_finalize at exit I/O streams: –stdin is enabled only for the master thread (process with rank 0) –stdout and stderr are enabled for all the threads and line buffered –output from different MPI threads can be prepended with -prefix argument; output sent to mpirun process example: mpirun -prefix “ “ prints: Hello World Hello World –see man mpi(5) and man mpirun(1) for a complete description Systems with the HIPPI software installed will trigger usage of the HIPPI optimized communication (HIPPI bypass). If the hardware is not installed it is necessary to switch the HIPPI bypass off (setenv MPI_BYPASS_OFF TRUE) With f77/f90, the -auto_use mpi_interface flag is available to check the consistency of mpi arguments at compile time With -64 compilation, mpi run time maps out the address space such that shared memory optimizations are available to circumvent the double copy problem. In particular, communication involving static data (I.e. common blocks) can be sped up.

25 TM SGI Message-Passing Software SGI Message Passing Toolkit (MPT 1.5) MPI, SHMEM, PVM components Packaged with Array Services software MPT external web page: – http://www.sgi.com/software/mpt/ MPT engineering internal web page – http://wwwmn.americas.sgi.com/mpi/

26 TM SGI MPI Engineering Expertise Large IRIX cluster system support at ASCI Large NUMAlink-based system parallel programming Cray MPP parallel programming on T3D and T3E background Collaboration with Myricom for clustering with Myrinet interconnect

27 TM SGI Message-Passing Toolkit Fully MPI 1.2 standard compliant (based on MPICH) SHMEM API for one-sided communication Support for selected MPI-2 features and will continue enhancing as customer needs dictate – MPI I/O (ROMIO version 1.0.2) – MPI one-sided communication – Thread safety – Fortran 90 bindings: USE MPI – C++ bindings PVM available on IRIX (Public Domain version)

28 TM MPT: Supported Platforms Now IRIX SSI IRIX clusters (GSN, Hippi, Ethernet) IA32 and IA64 SSI with Linux IA32 cluster (Myrinet, Ethernet) with Linux Soon Partitioned IRIX (NUMAlink interconnect) IRIX clusters (Myrinet) Partitioned SN IA (NUMAlink interconnect) IA64 cluster (Myrinet, Ethernet)

29 TM Convenience Features in MPT MPI job management with LSF, NQE, PBS, others Totalview debugger interoperability Fortran MPI subroutine interface checking at compile time with USE MPI Aborted cluster jobs are cleaned up automatically Array Services provides job control for cluster jobs Array Services and MPI work together to propagate user signals to all slaves Use shell modules to install multiple versions of MPT on the same system.

30 TM MPI Performance Low latency and high bandwidth. Fetchop-assisted fast message queuing Fast fetchop tree barriers Very fast MPI and SHMEM one-sided communication Interoperability with SHMEM Support for SSI to 512 P Automatic NUMA placement Optimized MPI collectives Internal MPI statistics reporting Integration with PCP Direct send/recv transfers No-impact thread safety support Runtime MPI tuning

31 TM NUMAlink Implementation Used by MPI_Barrier, MPI_Win_fence, and shmem_barrier_all Fetch-Op-variables on Hub provide fast synchronization for flat and tree barrier methods The Fetch-Op AMO helped reduce MPI send/recv latency from 12 to 8 usec CPU HUB ROUTER CPU Fetch-op variable

32 TM NUMAlink-based MPI Performance MPI Performance on Origin 2000 (Origin 3000)

33 TM I/O Interconnect Performance Low latency and high bandwidth on GSN Support for clusters up to 48 hosts x 128 cpus Support for fast packet error checking and retransmit. Multi-NIC support Array Services expedites job launch NUMA placement coordinates with cluster placement MPI collectives optimized for cluster of SMPs. Internal MPI statistics reporting Hierarchical job startup Environment variables tune MPI for maximum performance

34 TM ASCI Blue Mountain Cluster 16 way HIPPI-800 switches and the SPPM benchmark led to this topology There are 36 separate networks with 576 HIPPI adapters. Inside each 8 host cluster, the connectivity is 12. Going outside a cluster, connectivity drops to 4 or 2. = 8 x128 O2000 = 32 connections + 2 HIPPI switches

35 TM GSN Performance MPI Performance on Origin 2000 (300 Mhz)

36 TM SHMEM Model

37 TM SHMEM API

38 TM SHMEM API

39 TM SHMEM API

40 TM COMMUNICATECOMMUNICATE COMMUNICATECOMMUNICATE One-Sided Communication Pattern COMPUTECOMPUTE 0 1 234234 N-1 COMPUTECOMPUTE Barriers Processes Time

41 TM Learning about One-Sided (put/get) Communication Three MPI put/get paradigms defined; SGI has implemented the simplest paradigm: communicate, sync, compute, sync Read MPI_Win man page and the MPI-2 standard at – http://www.mpi-forum.org/docs/mpi-20.ps Use SHMEM API for simplest approach to one-sided communication

42 TM MPI Message Exchange(on host) Process 0 Process 1 src dst 0 01 Shared memory MPI_Send(src,len,…)MPI_Recv(dst,len,…) 0 1 Message headers 1 Data buffers Message queues Message headers 01 fetchop

43 TM MPI Message Exchange using Single Copy (on host) Process 0 Process 1 src dst 01 Shared memory MPI_Send(src,len,…)MPI_Recv(dst,len,…) 0 1 Message headers Message queues Message headers 01 fetchop

44 TM Performance of Synchronous Communication

45 TM Performance of Synchronous Communication

46 TM Using Single Copy send/recv Set MPI_BUFFER_MAX to N any message with size > N bytes will be transferred by direct copy if –MPI semantics allow it –-64 ABI is used –the memory region it is allocated in is a globally accessible location N=2000 seems to work well –shorter messages don’t benefit from direct copy transfer method Look at stats to verify that direct copy was used.

47 TM Making Memory Globally Accessible for Single Copy send/recv User’s send buffer must reside in one of the following regions: –static memory (-static/common blocks/DATA/SAVE) –symmetric heap (allocated with SHPALLOC or shmalloc) –global heap (allocated with f90 ALLOCATE statement and SMA_GLOBAL_ALLOC, MIPSPro version 7.3.1.1m ) When SMA_GLOBAL_ALLOC is set, usually need to increase global heap size by setting SMA_GLOBAL_HEAP_SIZE

48 TM Double Ping Pong Test The double ping pong test : (known as COMMS2 in the Parkbench suite) If (my_rank.EQ. master) Then T0 = mpi_time() Do i = 1, NREPT CALL mpi_sendrecv (A, ilen, MPI_BYTE, nslave, 10, B, ilen, MPI_BYTE, nslave, 20, MPI_COMM_WORLD, ierr) CALL mpi_sendrecv (A, ilen, MPI_BYTE, nslave, 10, B, ilen, MPI_BYTE, nslave, 20, MPI_COMM_WORLD, ierr) End do T1 = mpi_time() Tn = (T1-T0)/(NREPT*2) Else Do i = 1, NREPT CALL mpi_sendrecv (A, ilen, MPI_BYTE, master, 20, B, ilen, MPI_BYTE, master, 10, MPI_COMM_WORLD, status, ierr) CALL mpi_sendrecv (A, ilen, MPI_BYTE, master, 20, B, ilen, MPI_BYTE, master, 10, MPI_COMM_WORLD, ierr) End do Endif If (my_rank.EQ. master) Then T0 = mpi_time() Do i = 1, NREPT CALL shmem_putmem (B, A, ilen, nslave) CALL shmem_barrier_all ()!optional CALL shmem_getmem (A, B, ilen, nslave) CALL shmem_barrier_all ()!optional End do T1 = mpi_time() Tn = (T1-T0)/(NREPT*2) Else Do i = 1, NREPT CALL shmem_putmem (B, A, ilen, master) CALL shmem_barrier_all ()!optional CALL shmem_getmem (A, B, ilen, master) CALL shmem_barrier_all ()!optional End do Endif MPI VersionSHMEM Version Since the send/recv arrays (A & B) are independent no synchronization is necessary in the shmem example but it simulate the best the MPI behavior.

49 TM Double Ping Pong Test Performance of the double ping pong test The test case shows cache effects since every send/recv operation is performed 50 times. The importance is to show that the bcopy bandwidth of 150 MB/s can now be reached with MPI programs. Actions: convert to Shmem used single copy versions on remotely accessible variables Double ping-pong Bandwidth for R12K@300MHz:

50 TM Global Communication Test The ALL-to-ALL communication test : (known as COMMS3 in the Parkbench suite) Send (A)Receive (B) p0 p1 p2 pn p0 p1 pn iw

51 TM Global Communication The ALL-to-ALL communication test : C every processor sends message to every other processor C then every processor receives messages directed to it. T0 = mpi_time() Do I = 1, NREPT CALL mpi_alltoall (A, iw, MPI_DOUBLE_PRECISION, B, iw, MPI_DOUBLE_PRECISION, MPI_COMM_WORLD,ier) End do T1 = mpi_time() Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages T0 = mpi_time() Do I = 1, NREPT CALL shmem_barrier_all () Do j=0, NP-1 other = MOD (my_rank+j, NP) CALL shmem_put8(B(1+iw*my_rank), A(1+iw*other), iw, other) enddo T1 = mpi_time() Tn = (T1-T0)/(NREPT*NP*(NP-1)) ! NP processes send NP-1 messages MPI Version SHMEM Version

52 TM Global Communication Performance of the global communication test The test case shows cache effects since every operation is performed 50 times. Global communication routines do already uses in MPT_1.4.0.0 a single copy algorithm for remotely accessible variables. AlltoAll Bandwidth for R12K@300MHz: Actions: convert to Shmem used single copy versions on remotely accessible variables

53 TM Global Communication Conclusions: Implement critical data exchange in MPI programs with SHMEM or single copy MPI on static or ( shmalloc/shpalloc ) allocated data. Single copy Double copy

54 TM MPI get/put For codes that are latency sensitive, try using one- sided MPI (get/put). latency over NUMAlink on O3000: –send/recv: 5 microseconds –mpi_get: 0.7 microseconds if portability isn’t an issue use SHMEM instead –shmem_get latency: 0.5 microseconds (estimate by MPT group) –much easier to write code

55 TM Transposition with SHMEM vs. send/recv call shmem_barrier_all do 150 kk=1,lmtot ktag=ksendto(kk) call shmem_put8( y(1+(ktag-1)*len), x(1,ksnding(kk), len, ipsndto(kk) ) continue call shmem_barrier_all ltag=0 do 150 kk=1,lmtot ltag=ltag+1 ktag=ksendto(kk) call mpi_isend(x(1,ksnding(kk), len, mpireal, ipsndto(kk), ktag, mpicomm, iss(ltag), istat) ltag=ltag+1 ktag=krcving(kk) call mpi_irecv(y(1,krcving(kk), len, mpireal, iprcvfr(kk), ktag, mpicomm, iss(ltag), istat) 150 continue call mpi_wait_all(ltag,iss,istatm, istat)

56 TM Transposition with MPI_put common/buffer/ yg(length) integer(kind=MPI_ADDRESS_KIND) winsize, target_disp ! Setup: create a window for array yg since we will do puts into it call MPI_type_extent(MPI_REAL8, isizereal8, ierr) winsize=isizereal8*length call MPI_win_create(yg, winsize, isizereal8, MPI_INFO_NULL, MPI_COMM_WORLD, iwin, ierr)

57 TM Transposition with MPI_put call mpi_barrier(MPI_COMM_WORLD,ierr) do 150 kk=1,lmtot ktag=ksendto(kk) target_disp=(1+(ktag-1)*len)-1 call mpi_put(x(1,ksnding(kk), len, MPI_REAL8, ipsndto(kk), target_disp, len, MPI_REAL8, iwin, ierr) 150 continue call mpi_win_fence(0, iwin, ierr) do kk=1,len*lmtot y(kk)=yg(kk) end do ! Cleanup - destroy window call mpi_barrier(MPI_COMM_WORLD,ierr) call mpi_win_free(iwin, ierr)

58 TM Performance of One-Sided Communication

59 TM Performance of One-Sided Communication

60 TM Performance of the Message Passing Libraries Latency is the time it takes to pass a very short (zero length) message Bandwidth is the sustained performance passing long messages the “single” test is using the send/recv pair; “multiple” test uses the equivalent of the sendrecv primitive Note that a single bcopy speed on Origin2000 is about 150 MB/s MPI suffers a performance disadvantage with respect to SHMEM due to the fact that MPI semantics requires separate address spaces between threads. Therefore MPI implementation requires “double copy” to pass messages. SHMEM is optimized for one-sided communication as is done for SMP programming and therefore shows a very good latency measurement. MPI-2

61 TM MPI Tips for Performance Use ABI 64 for additional memory cross- mapping MPI optimizations Use cpusets for best reproducible results in batch environment Avoid over-subscription of tasks to physical CPUs in a throughput benchmark Use the -stats option and MPI tuning variables

62 TM MPI Tips for Performance Try direct-copy send/receive for memory bandwidth improvement and collective calls Use one-sided communication for latency (& memory bandwidth) improvement Try setting MPI_DSM_MUSTRUN or SMA_DSM_MUSTRUN to maintain CPU / memory affinity Do NOT use bsend/ssend or wild cards (MPI_ANY_SOURCE, MPI_ANY_TAG) for message headers

63 TM Important Environment Variables MPI_DSM_MUSTRUN MPI_REQUEST_MAX MPI_GM_ON MPI_BAR_DISSEM MPI_BUFS_PER_PROC MPI_BUFS_PER_HOST MPI_BUFFER_MAX “-stats” mpirun option / Totalview display

64 TM MPI Programs: Debugging Finding errors in parallel programs is difficult. This is the sequence for attaching the debugger to an MPI program: –mpirun -np N ssrun -hang program prog-args this will stop execution when the shepherd process is created –obtain the process id of the program with ps -ef | grep program –attach debugger to that process: cvd -pid #procid –in cvd select the menus Admin->Multiprocess View->Config->Preferences… and enable “Attach to forked processes”. –Continue execution. The Multiprocess View will show a list of processes that will be created at that time. Stop All, change focus, etc. Alternatively, MPI program can be instrumented to stop at the desired place (e.g. put read *,dummy in the code) and debugger attached to any thread after the creation. Last resort are the print statements in the code.

65 TM MPI Performance Experiments Performance data on MPI programs can be collected with: mpirun -np N perfex -a -y -mp -o perfex.out prog-args the -o perfex.out.#procid file will contain event counts for every MPI thread and the perfex.out will contain aggregate for all the threads together Profiling data on MPI programs can be collected with: mpirun -np N ssrun -experiment program prog-args the experiment is one of the usual experiments (pcsamp, usertime, etc.) or mpi: –mpirun -np N ssrun -workshop -mpi prog will produce N prog.mpi.f#procid files; these files can be aggregated with the ssaggregate tool and interactively viewed with cvperf tool –ssaggregate -e prog.mpi.f* -o prog.mpi_all –cvperf prog.mpi_all or prof prog.mpi_all –the following routines are traced (see man ssrun(1) ): MPI_Barrier(3) MPI_Send(3) MPI_Bsend(3) MPI_Ssend(3) MPI_Rsend(3) MPI_Isend(3) MPI_Ibsend(3)MPI_Issend (3) MPI_Irsend(3) MPI_Sendrecv(3) MPI_Sendrecv_replace(3)MPI_Bcast(3) MPI_Recv(3) MPI_Irecv(3) MPI_Wait(3)MPI_Waitall(3) MPI_Waitany(3) MPI_Waitsome(3) MPI_Test(3) MPI_Testall(3) MPI_Testany(3) MPI_Testsome(3) MPI_Request_free(3) MPI_Cancel(3) MPI_Pcontrol(3)

66 TM MPI versus OpenMP

67 TM MPI versus OpenMP

68 TM MPI versus OpenMP

69 TM MPI versus OpenMP

70 TM MPI versus OpenMP

71 TM SGI Message-Passing References “relnotes mpt” gives information about new features “man mpi” tells about all environment variables “man shmem” tells about the SHMEM API MPI Reference Manuals viewable with insight viewer – “Message Passing Toolkit: MPI Programmer’s Manual” (document # 007-3687-005) MPT web page: – http://www.sgi.com/software/mpt MPI Web Sites: – http://www.mpi-forum.org – http:/www.mcs.anl.gov/mpi/index.html

72 TMSummary It is important to understand the semantics of MPI The send/receive calls provide for data synchronization, not necessarily process synchronization A correct MPI program cannot depend on buffering for messages For a highly optimized MPI program, it is important to use only few optimized subroutines from the MPI library, typically straight send/receive variants The SGI implementation of MPI uses N+1 processes in parallel region, therefore it is better for scalability to run MPI with smaller number of processors than physically available in the machine Proprietary Message Passing Libraries (I.e. SHMEM) perform better than MPI on the Origin, because MPI’s generic interface makes it much harder to optimize


Download ppt "TM Message Passing MPI on Origin Systems. TM MPI Programming Model."

Similar presentations


Ads by Google