Presentation is loading. Please wait.

Presentation is loading. Please wait.

10/4/20131 Enhancing Performance Portability of MPI Applications Through Annotation-Based Transformations Md. Ziaul Haque, Qing Yi, James Dinan, and Pavan.

Similar presentations


Presentation on theme: "10/4/20131 Enhancing Performance Portability of MPI Applications Through Annotation-Based Transformations Md. Ziaul Haque, Qing Yi, James Dinan, and Pavan."— Presentation transcript:

1 10/4/20131 Enhancing Performance Portability of MPI Applications Through Annotation-Based Transformations Md. Ziaul Haque, Qing Yi, James Dinan, and Pavan Balaji ICPP, Oct, 2013. Lyon, France.

2 Motivation MPI provides a wide variety of communication operations –One-sided vs. two-sided –Synchronous vs. asynchronous –Collective vs individual sends/recvs Performance of these operations are sensitive to –Their context of uses within applications –Hardware support for inter-node communications –Underlying MPI library and system capabilities Optimizations within MPI libraries are insufficient – Libraries cannot see the context of the operations and thus cannot optimize beyond a single operation 10/4/20132 Node i Node j Node k

3 10/4/20133 Enhancing Performance Portability of MPI Applications Applications must parameterize communications to –Send the messages of the right sizes –Overlap communications with computation –Use appropriate communication operations So that the knobs can be automatically tuned at or before runtime –Here we consider compilation time Use annotations to allow explicit parameterization of implementation algorithms –Programmable control of optimizations –Integration of domain knowledge –Fine-grained parameterization of transformations –Automated tuning for performance portability

4 Outline Annotation-driven transformation framework –Light weight program transformations Using the POET program transformation language –Optimizing MPI applications for performance portability Optimizing the use of MPI libraries –The Annotation language –Automating program transformations Coalescing of MPI one-sided communications Overlapping communication with computation Selecting the appropriate MPI operations –Experimental results Conclusion and future research 10/4/20134

5 5 Optimizing MPI Applications Vendor Compiler (e.g. icc/gcc) Annotated code Performance Measurements Executable Modified source code System properties Developer Program Transformation Optimization Analyzer Platform Analysis Transformation configuration Annotated code

6 10/4/20136 Implemented Using The POET Language A scripting language for –Applying parameterized program transformations Interpreted by search engine and transformation engine –Programmable control of compiler optimizations –Ad-hoc translation between arbitrary languages Under development since 2006 –Open source (BSD license) Language documentation and download available at www.cs.uccs.edu/~qyi/poet

7 The Annotation Language Recognizes only annotated statement blocks #pragma mpi @pragma@ stmt of the underlying language –Each @pragma@ is one of the following annotations osc_coalesce (@win_buf_spec@) …… [nooverlap] cco @mpi_comm@(arg1,…,argm) …… rma (@win_buf_spec@) …… local_ldst (@win_buf_spec@) …… [nooverlap] indep @mpi_comm@(arg1,…,argm) …… Each transformation is driven by a pragma –Future work will seek to automatically generate pragma via program analysis 10/4/20137

8 Annotation-driven Optimization Algorithm input: input MPI program to optimize; config: architecture configurations of the system; 10/4/20138 foreach annotated MPI block (annot, body) in input: (1) if (is data coalesce annot(annot)): foreach win ∈ win_buf_list(annot): mpi_osc_data_coalesce(win, has overlap(annot), body); (2) if (is cco annot(annot)): foreach comm ∈ comm groups of(annot): mpi comp comm overlap(comm, innermost body of(body)); (3) if (is rma annot(annot)): foreach win ∈ win buf list of(annot): if (cache coh(config)): mpi rma 2 ldst(win,body); (4) if (is ldst annot(annot)): if (cache coh(config)): mpi_ldst_coh(win,has_overlap(annot),body); else mpi ldst incoh(win,has overlap(annot),body);

9 Coalescing of One-sided Communications Group communications to the same destination; Postpone communication until a dedicated buffer for the group is full The actual transformation generates complex code to accommodate –Dynamic coalescing of messages in loops; Parameterization of message buffer sizes 10/4/20139 #pragma mpi osc_coalesce (win) nooverlap { MPI_Win_fence(win); MPI_Accumulate(x[0], target, win); MPI_Accumulate(x[1], target, win); foo(); MPI_Put(y[0], target1, win); MPI_Put(y[1], target2, win); MPI_Put(y[2], target1, win); MPI_Win_fence(win); } MPI_Win_fence(win); MPI_Accumulate(x[0,1], target, win); foo(); MPI_Put(y[0,2], target1, win); MPI_Put(y[1], target2, win); MPI_Win_fence(win); Original code with pragma Optimized pseudo code

10 Communication Coalescing: Key Strategies Grouping of MPI communications –Members have the same destination and use the same MPI_Put/Get or the same reduction in Accumulate –Allocate a dedicated buffer for each group Postpone communications until a coalescing buffer is full (constrained by preset CL_factor) –Use AVL trees to resolve conflicting addresses of Accumulate Unless a “no overlap” clause is given by user annotation Clear all buffers at the final synchronization –Free coalescing buffers for reuse Handle unknown function calls –Treat as potential synchronizations –Trigger clearing of coalescing buffers Unless annotated as safe statements by user annotations 10/4/201310

11 Overlapping Communication With Computation Split synchronous operations into asynchronous ones and waits –Move asynchronous operations up as early as possible –Move wait operations as late as possible –Use the “indep” annotation to indicate independence of computation/comm Ongoing extension: breaking up communications into smaller messages before overlapping them with computation 10/4/201311 #pragma mpi cco MPI_SendRecv(ew_comm,ns_comm ) for( i=0; i0) MPI_Send(…,ns_comm); if (ns_id0) MPI_Isend(…,ns_comm,&r1); if (ns_id { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/13/3875803/slides/slide_11.jpg", "name": "Overlapping Communication With Computation Split synchronous operations into asynchronous ones and waits –Move asynchronous operations up as early as possible –Move wait operations as late as possible –Use the indep annotation to indicate independence of computation/comm Ongoing extension: breaking up communications into smaller messages before overlapping them with computation 10/4/201311 #pragma mpi cco MPI_SendRecv(ew_comm,ns_comm ) for( i=0; i0) MPI_Send(…,ns_comm); if (ns_id0) MPI_Isend(…,ns_comm,&r1); if (ns_id0) MPI_Send(…,ns_comm); if (ns_id0) MPI_Isend(…,ns_comm,&r1); if (ns_id

12 Remove Memory Accesses vs. Local Loads/stores Performance penalties of mixing RMA and local load/stores –Exclusive locks are required when using local load/stores, which are faster when hardware supports cache coherence –Locking unnecessary when the hardware supports cache coherence Optimization: automatically selects the best operations based on underlying system support of hardware platforms 10/4/201312 #pragma mpi rma(win,buf,int,MPI_INT,wsize,wr ank) { MPI_Win_lock(MPI_LOCK_SHARED, i, 0, win); for (j = 0; j < BUF_PER_PROC ; j++) { MPI_Put(&wrank,1,MPI_INT,i, base+j,1,MPI_INT,win); } MPI_Win_unlock(i, win); } #pragma mpi local_ldst(win,buf,int, MPI_INT, wsize,wrank) no_overlap { MPI_Win_lock( MPI_LOCK_EXCLUSIVE, i, 0, win ); for (j = 0; j < BUF_PER_PROC ; j++) { buf[base+j] = wrank; } MPI_Win_unlock(i, win); } Using Remote memory accessesUsing local loads/stores

13 10/4/201313 Experimental Results Goal: studying the performance portability of MPI applications Using four benchmarks, with FT manually transformed Using two supercomputers from DOE/ANL –Fusion: a cluster with 320 nodes, each with two Intel Nehalem Quad-Core 2.6 GHz processors and 36 GB of memory, interconnected via InfiniBand –{Surveyor}, a Blue Gene/P system with 1024 compute nodes, each with a quad-core 850 MHz PowerPC 450 processor and 2 GB memory. NameBenchmarkDescriptionTransformation applied bfsGraph500Breadth-first search of undirected graph OSC coalesce rma- ldst SyntheticRandom communications using MPI_Put RMA vs. local ld/st translation stencilSynthetic3D stencil using MPI send/recv Comm/comp overlapping FTNAS3D PDE using MPI all-to-allCollective vs. one- sided comm

14 Result: Applying osc_coalesce to bfs on Fusion (using 128 nodes) 10/4/201314

15 Result: Applying cco to stencil on Surveyor 10/4/201315

16 Result: Optimizing NAS FT on Fusion(top) on Surveyor (bottom) 10/4/201316

17 Conclusions Most MPI optimizations are platform sensitive, it is difficult to determine a priori –What is the best message size to send/receive –Which communication operation to use –How much memory to use to coalesce messages Automating the optimizations –Need to parameterize optimization configurations and specialize applications for each individual platform –Need to allow developers to provide hints and help --- annotation driven program analysis & transformation Future work –Apply optimizations across procedure boundaries –Automatically determine opportunities and generate annotations 10/4/201317


Download ppt "10/4/20131 Enhancing Performance Portability of MPI Applications Through Annotation-Based Transformations Md. Ziaul Haque, Qing Yi, James Dinan, and Pavan."

Similar presentations


Ads by Google