Presentation is loading. Please wait.

Presentation is loading. Please wait.

 2006 Michigan Technological University CS 6091 3/15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2,

Similar presentations


Presentation on theme: " 2006 Michigan Technological University CS 6091 3/15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2,"— Presentation transcript:

1  2006 Michigan Technological University CS 6091 3/15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2, Y. Zheng 3, M. Farreras 4, J. Nelson Amaral 1 1 University of Alberta 2 IBM Watson Research Center 3 Purdue 4 Universitat Politecnica de Catalunya IBM Research Report RC23853 January 27, 2006

2  2006 Michigan Technological University CS 6091 3/15/6 2 Abstract  UPC is scalable and competitive with MPI on hundreds of thousands of processors.  This paper discusses the compiler and runtime system features that achieve this performance on the IBM BlueGene/L.  Three benchmarks are used:  HPC RandomAccess  HPC STREAMS  NAS Conjugate Gradient (CG).

3  2006 Michigan Technological University CS 6091 3/15/6 3 1. BlueGene/L  65,536 x 2-way 700MHz processors (low power)  280 sustained Tflops on HPL Linpack  64 x 32 x 32 3d packet-switched torus network  XL UPC compiler and UPC runtime system (RTS)

4  2006 Michigan Technological University CS 6091 3/15/6 4 2.1 XL Compiler Structure  UPC source is translated to W-code  An early version did as MuPC: calls to the RTS were inserted into W-code. This prevents optimizations such as copy propagation and common sub-expression elimination.  The current version delays the insertion of RTS calls. W- code is extended to represent shared variables and the memory access mode (strict or relaxed).

5  2006 Michigan Technological University CS 6091 3/15/6 5 XL Compiler (cont’d)  Toronto Portable Optimizer (TPO) can “apply all the classical optimizations” to shared memory accesses.  UPC-specific optimizations are also performed.

6  2006 Michigan Technological University CS 6091 3/15/6 6 2.2 UPC Runtime System  The RTS targets  SMPs using Pthreads  Ethernet and LAPI clusters using LAPI  BlueGene/L using the BlueGene/L message layer  TPO does link-time optimizations between the user program and the RTS.  Shared objects are accessed through handles.

7  2006 Michigan Technological University CS 6091 3/15/6 7 Shared objects  The RTS identifies five shared object types:  shared scalars  shared structures/unions/enumerations  shared arrays  shared pointers [sic] with shared targets  shared pointers [sic] with private targets  “Fat” pointers increase remote access costs and limit scalability.  (optimizing remote accesses is discussed soon)

8  2006 Michigan Technological University CS 6091 3/15/6 8 Shared Variable Directory (SVD)  Each thread on a distributed memory machine contains a two- level SVD containing handles pointing to all shared objects.  The SVD in each thread has THREADS+1 partitions.  Partition i contains handles for shared objects in thread i, except the last partition which contains handles for statically declared shared arrays.  Local sections of shared arrays do not have to be mapped to the same address on each thread.

9  2006 Michigan Technological University CS 6091 3/15/6 9 SVD benefits  Scalability: Pointers to shared do not have to span all of shared memory. Only the owner knows the addresses of its shared object. Remote access are made via handles.  Each thread mediates access to its shared objects so coherence problems are reduced 1.  Only nonblocking synchronization is needed for upc_global_alloc(), for example. 1 Runtime caching is beyond the scope of this paper.

10  2006 Michigan Technological University CS 6091 3/15/6 10 2.3 Messaging Library  This topic is beyond the scope of this talk.  Note, however, that the network layer does not support one- sided communication.

11  2006 Michigan Technological University CS 6091 3/15/6 11 3. Compiler Optimizations  3.1 upc_forall(init; limit; incr; affinity)  3.2 local memory optimizations  3.3 update optimizations

12  2006 Michigan Technological University CS 6091 3/15/6 12 3.1 upc_forall  The affinity parameter may be:  pointer-to-shared  integer type  continue  If the (unmodified) induction variable is used the conditional is eliminated.  This is the only optimization technique used.  “... even this simple optimization captures most of the loops in the existing UPC benchmarks.”

13  2006 Michigan Technological University CS 6091 3/15/6 13 Observations  upc_forall loops cannot be meaningfully nested.  upc_forall loops must be inner loops for this optimization to pay off.

14  2006 Michigan Technological University CS 6091 3/15/6 14 3.2 Local Memory Operations  Try to turn dereferences of fat pointers into dereferences of ordinary C pointers.  Optimization is attempted only when affinity can be statically determined.  Move the base address calculation to the loop preheader (initialization block).  Generate code to access intrinsic types directly, otherwise use memcpy.

15  2006 Michigan Technological University CS 6091 3/15/6 15 3.3 Update Optimizations  Consider operations of the form r = r op B, where r is a remote shared object and B is local or remote.  Implement this as an active message [Culler, UC Berkeley].  Send the entire instruction to the thread with affinity to r.

16  2006 Michigan Technological University CS 6091 3/15/6 16 4. Experimental Results  4.1 Hardware  4.2 HPC RandomAccess benchmark  4.3 Embarrassingly Parallel (EP) STREAM triad  4.4 NAS CG  4.5 Performance evaluation

17  2006 Michigan Technological University CS 6091 3/15/6 17 4.1 Hardware  Development done on 64-processor node cards.  TJ Watson: 20 racks, 40960 processors  LLNL: 64 racks, 131072 processors

18  2006 Michigan Technological University CS 6091 3/15/6 18 4.2 HPC RandomAccess  111 lines of code  Read-modify-write randomly selected remote objects.  Use 50% of memory.  [Seems a good match for the update optimization.]

19  2006 Michigan Technological University CS 6091 3/15/6 19 4.3 EP STREAM Triad  105 lines of code  All computation is done locally within a upc_forall loop.  [Seems like a good match for the loop optimization.]

20  2006 Michigan Technological University CS 6091 3/15/6 20 4.4 NAS CG  GW’s translation of MPI code into UPC.  Uses upc_memcpy in place of MPI sends and receives.  It is not clear whether IBM used GW’s hand-optimized version.  IBM mentions that they manually privatized some pointers, which is what is done in GW’s optimized version.

21  2006 Michigan Technological University CS 6091 3/15/6 21 4.5 Performance Evaluation  Table 1:  FE is MuPC-style front end containing some optimizations  Others all use TPO front end  no optimizations  “indexing” is shared to local pointer reduction  “update” is active messages  “forall” is upc_forall affinity optimization  Speedups are relative to no TPO optimization  maximum speedup for random and stream is 2.11

22  2006 Michigan Technological University CS 6091 3/15/6 22 Combined Speedup  The combined stream speedup is 241!  This is attributed to the shared to local pointer reductions.  This seems inconsistent with “indexing” speedups of 1.01 and 1.32 for random and streams benchmarks, respectively.

23  2006 Michigan Technological University CS 6091 3/15/6 23 Table 2: Random Access  This is basically a measurement of how many asynchronous messages can be started up.  It is not known whether the network can do combining.  Beats MPI (0.56 vs. 0.45) on 2048 processors.

24  2006 Michigan Technological University CS 6091 3/15/6 24 Table 3: Streams  This is EP.

25  2006 Michigan Technological University CS 6091 3/15/6 25 CG  Speedup tracks MPI through 512 processors.  Speedup exceeds MPI on 1024 and 2048 processors.  This is a fixed-problem-size benchmark so network latency eventually dominates.  The improvement over MPI is explained: “In the UPC implementation, due to the use of one-sided communication, the overheads are smaller” compared to MPI two-sided overhead. [But the BlueGene/L network does not implement one-sided communication.]

26  2006 Michigan Technological University CS 6091 3/15/6 26 Comments? I have some …  Recall slide 12: “... even this simple [ upc_forall affinity] optimization captures most of the loops in the existing UPC benchmarks.”  From the abstract: “We demonstrate not only that shared memory programming for hundreds of thousands of processors is possible, but also that with the right support from the compiler and run-time system, the performance of the resulting codes is comparable to MPI implementations.”  The large-scale scalability demonstrated is for two 100-line codes for the simplest of all benchmarks.  The scalability of CG was knowingly limited by fixed problem size. Only two data points are offered that outperform MPI.

27  2006 Michigan Technological University CS 6091 3/15/6 27 WeekDateTopic 21/18Organizational meeting 3 1/25 UPC Tutorial SC05 Phil Merkeyslides 4 2/1 MuPC Runtime System Design PDP 2006 Steve Seidelslides 5 2/8 Reuse Distance in UPC PACT05 Steve Carrslides 6 2/15 UPC Memory Model IPDPS 2004 Øystein Thorsenslides part A slides part B 7 2/22 Planning meeting 8 3/1 Communication Optimizations in the Berkeley UPC Compiler PACT05 Weiming Zhaoslides 9 3/15 UPC on the BlueGene/L UPC on the BlueGene/L IBM Research Report RC23853 Steve Seidelslides 10 3/22 Reuse Distances in UPC Applications Phil Merkey 11 3/29 12 4/6 13 4/13 A UPC Performance Model IPDPS 2006 Steve Seidel 14 4/20


Download ppt " 2006 Michigan Technological University CS 6091 3/15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2,"

Similar presentations


Ads by Google