Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252 Class Project December 10, 2003

Unified Parallel C at LBNL/UCB Outline An Overview of UPC and the Berkeley UPC Compiler Overview of the Cray X1 Implementing the GASNet layer on the X1 Implementing the runtime layer on the X1 Serial performance Evaluation of compiler optimizations

Unified Parallel C at LBNL/UCB Unified Parallel C (UPC) UPC is an explicitly parallel global address space language with SPMD parallelism -An extension of ISO C -User level shared memory, partitioned by threads -One-sided (bulk and fine-grained) communication through reads/writes of shared variables Shared Global address space X[0] Private X[1] X[P]

Unified Parallel C at LBNL/UCB Shared Arrays and Pointers in UPC Cyclicshared int A[n]; Block Cyclicshared [2] int B[n]; Indefiniteshared [0] int * C = (shared [0] int *) upc_alloc(n); Use pointer-to-shared to access shared data -Block size part of the pointer type -A generic pointer-to-shared contains: -Address, Thread id, Phase -Cyclic and Indefinite pointers are phaseless A[0] A[2] A[4] … B[0] B[1] B[4] B[5]… C[0] C[1] C[2] … A[1] A[3] A[5] … B[2] B[3] B[6] B[7]… T0 T1

Unified Parallel C at LBNL/UCB Thread 1Thread N -1 Address ThreadPhase 0 2addr Phase Shared Memory Thread 0 block size start of array object … … Accessing Shared Memory in UPC start of block

Unified Parallel C at LBNL/UCB UPC Programming Model Features Block cyclically distributed arrays Shared and private pointers Global synchronization -- barriers Pair-wise synchronization – locks Parallel loops Dynamic shared memory allocation Bulk Shared Memory accesses Strict vs. Relaxed memory consistency models

Unified Parallel C at LBNL/UCB Overview of the Berkeley UPC Compiler Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Two Goals: Portability and High-Performance Lower UPC code into ANSI-C code Shared Memory Management and pointer operations Uniform get/put interface for underlying networks

Unified Parallel C at LBNL/UCB A Layered Design Portable: -C is our intermediate language -Can run on top of MPI (with performance penalty) -GASNet has a layered design with a small core High-Performance: -Native C compiler optimizes serial code -Translator can perform high-level communication optimizations -GASNet can access network hardware directly, provides a rich set of communication / synchronization primitives

Unified Parallel C at LBNL/UCB The Cray X1 Architecture All Gets/Puts must be loads/stores (directly or shmem interface) Only puts are “non-blocking”, gets are blocking Vectorization is crucial Vector pipeline 2x faster than scalar Utilization of memory bandwidth Strided accesses, scatter-gather, reduction, etc. New line of Vector Architecture Two modes of operation SSP up to 16 CPUs/node MSP multistreams long loops Single-node UMA, multi-node NUMA (no caching remote data) Global pointers Low latency, high bandwidth

Unified Parallel C at LBNL/UCB GASNet Communication System- Architecture 2-Level architecture to ease implementation: Core API -Based heavily on Active Messages -Compatibility layer -Port to X1 in 2 days, new algorithm to manipulate queues in Shared Memory Extended API -Wider interface that includes more complicated operations (puts, gets) -A reference implementation of the extended API in terms of the core API is provided -Current revision is tuned especially for the X1 with shared memory as the primary focus (minimal overhead) Compiler-generated code Compiler-specific runtime system GASNet Extended API GASNet Core API Network Hardware

Unified Parallel C at LBNL/UCB GASNet Extended API – Remote memory operations GASNet offers expressive Put/Get primitives -All Gets/Puts can be blocking and non-blocking -Non-blocking can be explicit (handle-based) -Non-blocking can be implicit (global or region-based) -Synchronization can poll or block -Paves the way for complex split-phase communication (compiler optimizations) Cray X1 uses exclusively shared memory -All Gets/Puts must be loads/stores -Only puts are “non-blocking”, gets are blocking -Very limited synchronization mechanisms -Efficient communication only through vectors (one order of magnitude between scalar and vector communication) -Vectorization instead of split-phase?

Unified Parallel C at LBNL/UCB GASNetCray X1 InstructionComment Bulk operationsVector bcopy()Fully vectorized, suitable for GASNet/UPC Non-bulk blocking putsStore + gsyncNo vectorization Non-bulk blocking getsLoad Non-bulk Non-blocking explicit puts/gets Store/load + gsyncNo vectorization if sync done in the loop Non-bulk Non-blocking implicit puts/gets Store/load + gsyncNo vectorization if sync done in the loop GASNet and Cray X1 Remote memory operations Flexible communications provides no benefit without vectorization (factor of 10 between vector and scalar) Difficult to expose vectorization through a layered software stack: Native C compiler now has to optimize parallel code! Cray X1 “big hammer” gsync() prevents interesting communication optimizations

Unified Parallel C at LBNL/UCB GASNet/X1 Performance GASNet/X1 improves small message performance Minimal overhead as “portable network assembly language” Core API (Active Messages) solves Cray problem of upc_global_alloc (non-collective memory allocation) Synthetic benchmarks show no GASNet interference, but not necessarily the case for application benchmarks

Unified Parallel C at LBNL/UCB Shared Pointer Representations Cray X1 “memory centrifuge” useful for UPC Possible to manipulate UPC phaseless pointers directly as X1 global pointers allocated by the symmetric heap Heavy function inlining and macros remove all traces of UPC Runtime and GASNet calls

Unified Parallel C at LBNL/UCB Cost of Shared Pointer Arithmetic and Accesses

Unified Parallel C at LBNL/UCB Serial Performance It’s all about vectorization Cray C highly sensitive to changes in inner loop Want translator’s output as vectorizable as original C source. Strategy: Keep translated code syntactically close to the source Preserve high level loops a[exp] becomes *(a+exp) Multidimensional arrays linearized Preserve restrict qualifier and ivdep pragmas

Unified Parallel C at LBNL/UCB Livermore Loop Kernels

Unified Parallel C at LBNL/UCB Evaluating Communication Optimizations on Cray X1 Message Aggregation LogGP model: fewer messages means less overhead Techniques: message vectorization, coalescing, bulk prefetching Still true for Cray X1? Remote access latency comparable to local accesses Vectorization should hide most overhead of small messages Remote data not cache coherent – may still help to store them into local buffers Essentially, a question of fine-grained vs. coarse- grained programming model

Unified Parallel C at LBNL/UCB NAS CG: OpenMP style vs. MPI style Fine-grained (OpenMP style) version still slower shared memory programming style leads more overhead (redundant boundary computation) UPC’s hybrid programming model can really help

Unified Parallel C at LBNL/UCB More Optimizations Overlapping Communication and Computation -Hides communication latencies with independent computation -Examples: communication scheduling, message pipelining -Requires split-phase operations: try to separate sync() as far as possible from non-blocking get/put -But Cray X1 lacks support for nonblocking gets -No user/compiler level overlapping -All communication optimizations rely on vectorization (e.g., gups) -Vectorization is too restrictive in our opinion: gives up on pointer code and sync(), bulk synchronous programs, etc

Unified Parallel C at LBNL/UCB Conclusion We have an efficient UPC implementation on Cray X1 Evaluation of Cray X1 for GAS languages: +Great latency/bandwidth for both local and remote memory operations +Remote communication transparent with global loads and stores -Lack of split-phase gets means losing optimization opportunities -Poor user-level support for communication and synchronization of remote operations (no prefetching, no non-binding or per-operation completion mechanisms) -Heavy reliance on vectorization for performance – great when it happens, not much so otherwise (slow scalar processor) -First platform more sensitive to translated code and less to communication/computation scheduling ±First possible mismatch for GASNet between semantics and platform – we’re hoping the X2 can address our concerns

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

Similar presentations

Presentation on theme: "Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

Similar presentations

Presentation on theme: "Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252."— Presentation transcript:

Similar presentations

About project

Feedback