Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication James Dinan, Pavan Balaji, Jeff Hammond, Sriram Krishnamoorthy, and Vinod Tipparaju.

Similar presentations


Presentation on theme: "Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication James Dinan, Pavan Balaji, Jeff Hammond, Sriram Krishnamoorthy, and Vinod Tipparaju."— Presentation transcript:

1 Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication James Dinan, Pavan Balaji, Jeff Hammond, Sriram Krishnamoorthy, and Vinod Tipparaju Presented by: James Dinan James Wallace Gives Postdoctoral Fellow Argonne National Laboratory

2 Global Arrays, a Global-View Data Model Distributed, shared multidimensional arrays –Aggregate memory of multiple nodes into global data space –Programmer controls data distribution, can exploit locality One-sided data access: Get/Put({i, j, k}…{i, j, k}) NWChem data management: Large coeff. tables (100GB+) 2 Shared Global address space Private Proc 0 Proc 1 Proc n X[M][M][N] X[1..9] [1..9][1..9] X

3 ARMCI: The Aggregate Remote Memory Copy Interface GA runtime system –Manages shared memory –Provides portability –Native implementation One-sided communication –Get, put, accumulate, … –Load/store on local data –Noncontiguous operations Mutexes, atomics, collectives, processor groups, … Location consistent data access –I see my operations in issue order 3 GA_Put({x,y},{x,y}) 0 2 1 3 ARMCI_PutS(rank, addr, …)

4 Implementing ARMCI ARMCI Support –Natively implemented –Sparse vendor support –Implementations lag systems MPI is ubiquitous –Support one-sided for 15 years Goal: Use MPI RMA to implement ARMCI 1.Portable one-sided communication for NWChem users 2.MPI-2: drive implementation performance, one-sided tools 3.MPI-3: motivate features 4.Interoperability: Increase resources available to application ARMCI/MPI share progress, buffer pinning, network and host resources Challenge: Mismatch between MPI-RMA and ARMCI 4 NativeARMCI-MPI

5 MPI Remote Memory Access Interface Active and Passive target Modes –Active: target participates –Passive: target does not participate Window: Expose memory for RMA –Logical public and private copies –Conservative data consistency model Accesses must occur within an epoch –Lock(window, rank) … Unlock(window, rank) –Access mode can be exclusive or shared –Operations are not ordered within an epoch 5 Unlock Rank 0 Rank 1 Get(Y) Put(X) Lock Completion Public Copy Public Copy Private Copy Private Copy

6 MPI-2 RMA Separate Memory Model Concurrent, conflicting accesses are erroneous Conservative, but extremely portable Compatible with non-coherent memory systems 6 Public Copy Public Copy Private Copy Private Copy Same source Same epoch Diff. Sources XX X X loadstore

7 ARMCI/MPI-RMA Mismatch 1.Shared data segment representation –MPI:, window rank –ARMCI:, absolute rank Perform translation 1.Data consistency model –ARMCI: Relaxed, location consistent for RMA, CCA undefined –MPI: Explicit (lock and unlock), CCA error Explicitly maintain consistency, avoid CCA 7

8 Translation: Global Memory Regions Translate between ARMCI and MPI shared data segment representations –ARMCI: Array of base pointers –MPI: Window object Translate between MPI and ARMCI ranks –MPI: Window/comm rank –ARMCI: Absolute rank Preserve MPI window semantics –Manage access epochs –Protect shared buffers 8 Metadata01…N 00x6b70x7af0x0c1 10x9d90x0 20x6110x38b0x659 30xa630x3f80x0 Allocation Metadata MPI_Winwindow; intsize[nproc]; ARMCI_Group grp; ARMCI_Mutex rmw; … Allocation Metadata MPI_Winwindow; intsize[nproc]; ARMCI_Group grp; ARMCI_Mutex rmw; … Absolute Process Id

9 GMR: Shared Segment Allocation base[nproc] = ARMCI_Malloc(size, group) All ranks can pass different size Size = 0 yields base[me] = NULL base[me] must be valid on node Solution: Allgather base pointers ARMCI_Free(ptr) Need to find allocation and group –Local process may pass NULL Solution: Allreduce to select leader –Leader broadcasts their base pointer –All can lookup the allocation 9 Metadata01…N 00x6b70x7af0x0c1 10x9d90x0 20x6110x38b0x659 30xa630x3f80x0 Absolute Process Id

10 GMR: Preserving MPI local access semantics ARMCI_Put( src = 0x9e0, dst = 0x39b, size = 8 bytes, rank = 1 ); Problem: Local buffer is also shared –Cant access without epoch –Can we lock it? Same window: not allowed Diff window: can deadlock Solution: Copy to private buffer src_copy = Lock; memcpy; Unlock xrank = GMR_Translate(comm, rank) Lock(win, xrank) Put(src_cpy, dst, size, xrank) Unlock(win, xrank) Metadata01…N 00x6b70x7af0x0c1 10x9d90x0 20x6110x38b0x659 30xa630x3f80x0 10 Allocation Metadata window= win; Size= { 1024, …}; Grp= comm; … Allocation Metadata window= win; Size= { 1024, …}; Grp= comm; … Absolute Process Id

11 ARMCI Noncontiguous Operations: I/O Vector Generalized noncontiguous transfer with uniform segment size: typedef struct { void **src_ptr_array; // Source addresses void **dst_ptr_array; // Dest. Addresses int bytes; // Length of all seg. int ptr_array_len; // Number of segments } armci_giov_t; Three methods to support in MPI 1.Conservative (one epoch): Lock, Put/Get/Acc, Unlock, … 2.Batched (multiple epochs): Lock, Put/Get/Acc, …, Unlock 3.Direct: Generate MPI indexed datatype for source and destination Single operation/epoch: Lock, Put/Get/Acc, Unlock Handoff processing to MPI 11 ARMCI_GetV(…)

12 ARMCI Noncontiguous Operations: Strided Transfer a section of an N-d array into an N-d buffer Transfer options: –Translate into an IOV –Generate datatypes MPI Subarray datatype: –dim = sl+1 –Dims[0] = count[0] –Dims[1..dim-2] = stride[i]/Dims[i-1] –Dims[dim-1] = Count[dim-1] –Index = { 0, 0, 0 } –Sub_dims = Count 12 ARMCI Strided Specification

13 Avoiding concurrent, conflicting accesses Contiguous operations –Dont know what other nodes do –Wrap each call in an exclusive epoch Noncontiguous operations –I/O Vector segments may overlap –MPI Error! Must detect errors and fall back to conservative mode if needed Generate a conflict tree –Sorted, self-balancing AVL tree 1.Search the tree for a match 2.If (match found): Conflict! 3.Else: Insert into the tree Merge search/insert steps into a single traversal 13 0x3f0 – 0x517 0x1a6 – 0x200 0x8bb – 0xa02 0x518 – 0x812 0xf37 – 0xf47 0x917 – 0xb10

14 Additional ARMCI Operations 1.Collectives: MPI user-defined collectives 2.Mutexes –MPI-2 RMA doesnt provide any atomic RMW operations Read, modify, write is forbidden within an epoch –Mutex implementation: Latham, Ross, & Thakur [IJHPCA 07] Does not poll over the network, space scales as O(P) 3.Atomic swap, fetch-and-add: –Atomic w.r.t. other atomics –Attach an RMW mutex to each GMR Mutex_lock, get, modify, put, Mutex_unlock Slow, best we can do in MPI-2 4.Non-blocking operations are blocking 5.Noncollective processor groups [EuroMPI 11] 14

15 Experimental Setup 15 SystemNodesCoresMemoryInterconnectMPI IBM BG/P (Intrepid)40,9601 x 42 GB3D TorusIBM MPI IB (Fusion)3202 x 436 GBIB QDRMVAPICH2 1.6 Cray XT5 (Jaguar)18,6882 x 616 GBSeastar 2+Cray MPI Cray XE6 (Hopper)6,3922 x 1232 GBGeminiCray MPI Communicaton Benchmarks –Contiguous bandwidth –Noncontiguous Bandwidth NWChem performance evaluation –CCSD(T) calculation on water pentamer IB, XT5: Native much better than ARMCI-MPI (needs tuning)

16 Impact of Interoperability ARMCI: lose performance with MPI buffer, unregistered path MPI: lose performance with ARMCI buffer, on-demand reg. 16 IB Cluster: ARMCI and MPI Get Bandwidth

17 Contiguous Communication Bandwidth (BG/P & XE6) 17 BG/P: Native is better for small to medium size messages –Bandwidth regime: get/put are same and acc is ~15% less BW XE: ARMCI-MPI is 2x better for get/put –Double precision accumulate, 2x better small, same large xfers

18 Strided Communication Bandwidth (BG/P) Segment size 1 kB Batched is best Other methods always pack –Packing in host CPU slower than injecting into network –MPI implementation should select this automatically Performance is close to native

19 Strided Communication Benchmark (XE6) Segment size 1 kB Batched is best for Acc Not clear for others Significant performance advantage over current native implementation –Under active development

20 NWChem Performance (BG/P) 20

21 NWChem Performance (XE6) 21

22 Looking Forward to MPI-3 RMA Unified memory model –Take advantage of coherent hardware –Relaxed synchronization will yield better performance Conflicting accesses –Localized to locations accessed –Relaxed to undefined –Load/store does not corrupt Atomic CAS, and Fetch-and-Add, and new accumulate operations –Mutex space overhead MPI-3: O(1) Request-based non-blocking ops Shared memory (IPC) windows 22 Public Copy Public Copy Private Copy Private Copy Unified Copy Unified Copy

23 Conclusions ARMCI-MPI Provides: –Complete, portable runtime system for GA and NWChem –MPI-2 performance driver –MPI-3 feature driver Mechanisms to overcome interface and semantic mismatch Performance is pretty good, dependent on impl. tuning –Production use on BG/P and BG/Q –MPI-2 wont match native because of separate memory model –MPI-3 designed to close the memory model gap Available for download with MPICH2 Integration with ARMCI/GA in progress Contact: Jim Dinan 23

24 24 Additional Slides

25 Mutex Lock Algorithm Mutex is a byte vector on rank p: V[nproc] = { 0 } Mutex_lock: 1.MPI_Win_lock(mutex_win, p, EXCL) 2.Put(V[me], 1) 3.Get(V[0..me-1, me+1..nproc-1]) 4.MPI_Win_unlock(mutex_win, p, EXCL) 5.If (V[0..me-1, me+1..nproc-1] == 0) Return SUCCESS 6. Else I am enqueued for the mutex Recv(NOTIFICATION) 25 Remote: Local: 1 00000000000 1

26 Mutex Unlock Algorithm Mutex is a byte vector on rank p: V[nproc] Mutex_unlock: 1.MPI_Win_lock(mutex_win, p, EXCL) 2.Put(V[me], 0) 3.Get(V[0..me-1, me+1..nproc-1]) 4.MPI_Win_unlock(mutex_win, p, EXCL) 5.For i = 1..nproc If (V[(me+i)%nproc] == 1) 1.Send(NOTIFICATION, (me+i)%nproc) 2.Break 26 Remote: Local: 0 10011010010 0 Notify

27 27 Additional Data

28 Contiguous Microbenchmark (BG/P and XE6) 28 BG/P: Native is better for small to medium size messages –Bandwidth regime: get/put are same and acc is ~15% less BW XE: ARMCI-MPI is 2x better for get/put –Similar for double precision accumulate

29 Strided Communication Bandwidth (BG/P)

30 Strided Communication Benchmark (XE6)

31 Contiguous Microbenchmark (IB and XT5) 31 IB: Close to native for get/put in the bandwidth regime –Performance is a third of native for accumulate XE: Close to native for moderately sized messages –Performance is half of native in the bandwidth regime

32 Strided Bandwidth (IB) Direct is the best option in most cases IOV-Batched is better for large message accumulate Tuning needed!

33 Strided Bandwidth (XT5) Direct is best –It should be! Tuning needed to better handle large messages and non- contiguous

34 NWChem Performance (XT5) 34

35 NWChem Performance (IB) 35


Download ppt "Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication James Dinan, Pavan Balaji, Jeff Hammond, Sriram Krishnamoorthy, and Vinod Tipparaju."

Similar presentations


Ads by Google