1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,

1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA

2 Motivation Parallel Programming Models MPI: de facto standard –difficult to program OpenMP: inefficient to map on distributed memory platforms –lack of locality control HPF: hard to obtain high-performance –heroic compilers needed! Global address space languages: CAF, Titanium, UPC an appealing middle ground

3 Co-Array Fortran Global address space programming model –one-sided communication (GET/PUT) Programmer has control over performance-critical factors –data distribution –computation partitioning –communication placement Data movement and synchronization as language primitives –amenable to compiler-based communication optimization

4 CAF Programming Model Features SPMD process images –fixed number of images during execution –images operate asynchronously Both private and shared data –real x(20, 20) a private 20x20 array in each image –real y(20, 20)[*] a shared 20x20 array in each image Simple one-sided shared-memory communication –x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns Synchronization intrinsic functions –sync_all – a barrier and a memory fence –sync_mem – a memory fence –sync_team([team members to notify], [team members to wait for]) Pointers and (perhaps asymmetric) dynamic allocation

5 integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) image 1image 2image N image 1image 2image N One-sided Communication with Co-Arrays

6 Rice Co-Array Fortran Compiler ( cafc ) First CAF multi-platform compiler –previous compiler only for Cray shared memory systems Implements core of the language –currently lacks support for derived type and dynamic co-arrays Core sufficient for non-trivial codes Performance comparable to that of hand-tuned MPI codes Open source

7 Outline CAF programming model cafc  Core language implementation –Optimizations Experimental evaluation Conclusions

8 Implementation Strategy Source-to-source compilation of CAF codes –uses Open64/ SL Fortran 90 infrastructure –CAF  Fortran 90 + communication operations Communication –ARMCI library for one-sided communication on clusters –load/store communication on shared-memory platforms Goals –portability –high-performance on a wide range of platforms

9 Co-Array Descriptors Initialize and manipulate Fortran 90 dope vectors real :: a(10,10,10)[*] type CAFDesc_real_3 integer(ptrkind) :: handle ! Opaque handle ! to CAF runtime representation real, pointer:: ptr(:,:,:) ! Fortran 90 pointer ! to local co-array data end Type CAFDesc_real_3 type(CAFDesc_real_3):: a

10 Allocating COMMON and SAVE Co-Arrays Compiler –generates static initializer for each common/save variable Linker –collects calls to all initializers –generates global initializer that calls all others –compiles global initializer and links into program Launch –invokes global initializer before main program begins allocates co-array storage outside Fortran 90 runtime system associates co-array descriptors with allocated memory Similar to handling for C++ static constructors

11 Parameter Passing Call-by-value convention (copy-in, copy-out) –pass remote co-array data to procedures only as values Call-by-co-array convention* –argument declared as a co-array by callee –enables access to local and remote co-array data Call-by-reference convention* ( cafc ) –argument declared as an explicit shape array –enables access to local co-array data only –enables reuse of existing Fortran code * requires an explicit interface call f (( a(I)[p] )) subroutine f(a) real :: a(10)[*] real :: x(10)[*] call f(x) subroutine f(a) real :: a(10)

12 Multiple Co-dimensions Managing processors as a logical multi-dimensional grid integer a(10,10)[5,4,*] 3D processor grid 5 x 4 x … Support co-space reshaping at procedure calls –change number of co-dimensions –co-space bounds as procedure arguments

13 Implementing Communication x(1:n) = a(1:n)[p] + … Use a temporary buffer to hold off processor data –allocate buffer –perform GET to fill buffer –perform computation: x(1:n) = buffer(1:n) + … –deallocate buffer Optimizations –no temporary storage for co-array to co-array copies –load/store communication on shared-memory systems

14 Synchronization Original CAF specification: team synchronization only –sync_all, sync_team Limits performance on loosely-coupled architectures Point-to-point extensions –sync_notify(q) –sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p  all communication from p to q issued before the notify has been delivered to q

15 Outline CAF programming model cafc –Core language implementation  Optimizations procedure splitting supporting hints for non-blocking communication packing strided communications Experimental evaluation Conclusions

16 An Impediment to Code Efficiency Original reference rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - … Transformed reference rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - … Fortran 90 pointer-based co-array representation does not convey –the lack of co-array aliasing –co-array contiguity –co-array bounds Lack of knowledge inhibits important code optimizations

17 Procedure Splitting subroutine f(…) real, save :: c(100)[*] interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_inner end interface call f_inner(…,c) end subroutine f subroutine f_inner(…, c_arg) real :: c_arg(100)[*]... = c_arg(50)... end subroutine f_inner subroutine f(…) real, save :: c(100)[*]... = c(50)... end subroutine f CAF to CAF preprocessing

18 Benefits of Procedure Splitting Generated code conveys –lack of co-array aliasing –co-array contiguity –co-array bounds Enables back-end compiler to generate better code

19 Hiding Communication Latency Goal: enable communication/computation overlap Impediments to generating non-blocking communication –use of indexed subscripts in co-dimensions –lack of whole program analysis Approach: support hints for non-blocking communication –overcome conservative compiler analysis –enable sophisticated programmers to achieve good performance today

20 Hints for Non-blocking PUTs Hints for CAF run-time system to issue non-blocking PUTs region_id = open_nb_put_region()... Put_Stmt_1... Put_Stmt_N... call close_nb_put_region(region_id) Complete non-blocking PUTs: call complete_nb_put_region(region_id) Open problem: Exploiting non-blocking GETs?

21 Strided vs. Contiguous Transfers Problem CAF remote reference might induce many small data transfers a(i,1:n)[p] = b(j,1:n) Solution pack strided data on source and unpack it on destination

22 Pragmatics of Packing Who should implement packing? The CAF programmer –difficult to program The CAF compiler –unpacking requires conversion of PUTs into two-sided communication (a difficult whole-program transformation) The communication library –most natural place –ARMCI currently performs packing on Myrinet

23 CAF Compiler Targets (Sept 2004) Processors –Pentium, Alpha, Itanium2, MIPS Interconnects –Quadrics, Myrinet, Gigabit Ethernet, shared memory Operating systems –Linux, Tru64, IRIX

24 Outline CAF programming model cafc –Core language implementation –Optimizations  Experimental evaluation Conclusions

25 Experimental Evaluation Platforms –Alpha+Quadrics QSNet (Elan3) –Itanium2+Quadrics QSNet II (Elan4) –Itanium2+Myrinet 2000 Codes –NAS Parallel Benchmarks (NPB 2.3) from NASA Ames

26 NAS BT Efficiency (Class C)

27 NAS SP Efficiency (Class C) lack of non-blocking notify implementation blocks CAF comm/comp overlap

28 NAS MG Efficiency (Class C) ARMCI comm is efficient pt-2-pt synch in boosts CAF performance 30%

29 NAS CG Efficiency (Class C)

30 NAS LU Efficiency (class C)

31 Impact of Optimizations Assorted Results Procedure splitting –42-60% improvement for BT on Itanium2+Myrinet cluster –15-33% improvement for LU on Alpha+Quadrics Non-blocking communication generation –5% improvement for BT on Itanium2+Quadrics cluster –3% improvement for MG on all platforms Packing of strided data –31% improvement for BT on Alpha+Quadrics cluster –37% improvement for LU on Itanium2+Quadrics cluster See paper for more details

32 Conclusions CAF boosts programming productivity –simplifies the development of SPMD parallel programs –shifts details of managing communication to compiler cafc delivers performance comparable to hand-tuned MPI cafc implements effective optimizations –procedure splitting –non-blocking communication –packing of strided communication (in ARMCI) Vectorization needed to achieve true performance portability with machines like Cray X1 http://www.hipersoft.rice.edu/caf

1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,

Similar presentations

Presentation on theme: "1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,

Similar presentations

Presentation on theme: "1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,"— Presentation transcript:

Similar presentations

About project

Feedback