Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003

Global Address Space The languages share the global address space abstraction –Shared memory is partitioned by processors –Remote memory may stay remote: no automatic caching implied –One-sided communication through reads/writes of shared variables –Both individual and bulk memory copies Shared Global address space X[0] Private ptr: X[1]X[P]

Unified Parallel C (UPC) UPC is a parallel extension to C for scientific computing –With distributed arrays, shared pointers, parallel loops, strict/relaxed memory model. –Global Address Space Abstraction SPMD parallelism Fixed number of threads Each thread keeps its own private data

Overview of Berkeley UPC Compiler Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Two Goals: Portability and High-Performance Open64 based

Motivation – Network Latencies UPC accessesPrivateLocal sharedRemote No. Cycles 1-2~10> 2K

Motivation – Small v. Large Messages

The Need for Software Caching High network latencies for GAS languages  encourage fine-grained programs with shared pointer-based communication Want to bridge the performance gap between fine- and coarse-grained applications Caching helps by eliminating redundant accesses and prefetching adjacent data. Compiler Analysis: difficult and imprecise. Hardware/DSM: portability, overhead, granularity. Use a software controlled cache inside the Berkeley UPC runtime.

Challenges: UPC Memory Consistency Model UPC has both strict and relaxed memory model  Strict: Like sequential consistency  Relaxed: Behave like single thread variables Two rules: 1.All shared accesses satisfy local data dependencies. 2.Program order of strict accesses must be maintained among all threads. Compiler can aggressively optimize relaxed accesses, and programmers can count on program order of strict accesses.

Problem with the Current Model Write X = 1 Read Y Write Y = 1 Read X (initially x = y = 0) Thread 0Thread 1 Read X Read YRead XWrite X = 1 Write Y = 1 Results of T1’s reads are (0, 1, 0), violates UPC memory model (T1 must observe the write to X before the strict write to Y) Possible Execution Order With Caching Time Cache miss (0) Read X Cache miss (1) Cache hit (0)

A New UPC Memory Model The current model is too restrictive for compiler optimizations that reorder memory accesses. Problem was that relaxed accesses can create data races with other threads’ strict accesses. A new definition based on Weak Ordering:  An UPC implementation must appear sequentially consistent to any programs that do not have data races involving relaxed accesses. sync Relaxed access

Caching and the New Model Model forbids data races on relaxed accesses  Still ok for strict accesses, so doesn’t limit expressiveness of the language. Permits an efficient caching implementation  Caches remote relaxed accesses  Flush the cache and sync pending writes at synchronization points (barriers, locks, fences, strict accesses)  A “local knowledge” scheme, so no need for coherence messages.

Cache Blocks Block descriptors Hashtable (shared address -> block descriptor) Pending reads Pending writes (per UPC target thread) Cache Organization

Cache Operations Writes:  find and sync any conflicting pending block reads  initiate network message for the write  write to cache block (if resident)  store write request in list of pending writes Read miss:  allocate cache block (evicting if necessary)  find and sync any conflicting pending writes  initiate remote cache block read  chain pending request off of descriptor Cache flush: at any strict read/write, barrier, etc.,  Just bzero hashtable and block status bits

Cache Operations Bulk and unaligned accesses  Unlike hardware cache, accesses of arbitrary length and start address  Fastpath assumes all accesses fit into a single cache block  Bulk operations usually imply hand-optimization: bypass cache Bulk writes: (same as small, but affect multiple blocks)  find and sync any conflicting pending block reads  initiate network message for the write  write to cache block (if resident)  store write request in list of pending writes Bulk reads:  Can completely bypass cache mechanism New memory model allows older values to remain in cache

Write-packing buffer Addr1 Data1 Len1 Addr 2 Addr1 Data1Data2 Len2Len1 Delay writes, and place into per-node write-packing buffer. -store addr/len info separately from the data Pack later messages around earlier ones. -results in fewer, larger messages

Write-packing buffer (cont.) Addr1 Data1 + Data2 Len1+2 Contiguous writes can be coalesced into a single message - just change the length field Strict writes can be packed into buffer, as long as remote side unpacks writes in correct order Min/max address can be kept to speed up conflict checks Also, trivially handles write-after-write conflicts: will be useful if conflict detection moved from compile-time to runtime

Preliminary Results LocalRemoteCache hits Time (ns)2027000350

Array Prefetching It is unlikely that a fine-grained program can match the performance of an equivalent coarse-grained program with runtime caching alone. Goal is to further bridge the gap between fine- grained programs and coarse-grained programs with prefetching. We implemented prefetching for regular array accesses.

Example A is a remote array Fine-grained Version sum = 0; foreach (p in A.domain()){ sum += A[p]; } Coarse-grained Version sum = 0; double [1d] local copyA; copyA = new double[A.domain()]; copyA.copy(A); foreach (p in copyA.domain()){ sum += copyA[p]; }

Titanium Memory Model 1. Locally sequentially consistent. For a single processor, all reads and writes to a given memory location must appear to occur in exactly the order specified. 2. Globally consistent at synchronization events. At a global synchronization event, such as a barrier, all processors must agree on the values of all the variables. At a non-global synchronization event, such as entry into a critical section, the processor must see all previous updates made using that synchronization event.

Implementation Identify array accesses inside of foreach loops as candidates for prefetching  The foreach loop is a full domain loop  The array access appears on every iteration of the loop  The addresses touched by the array access can be computed from the iteration domain and loop invariant pointer increments. Insert code in the loop setup for prefetching elements of the array that will be used during the foreach loop. Change array references inside of the loop to local ones

Implementation Need to make sure that sychronization operations do not get called during the execution of the foreach loop Flush the prefetched data at sychronization points. Resolve conflicts caused by remote writes and array copies by merging in the changes into the prefetched data

Benefits Message coalescing Local pointer accesses vs. global pointer accesses  Global pointer access handles the general case, where the data can be remote or local. Location of data is checked during runtime.  Local pointer access translates into a simple pointer dereference in C.

Configuration Seaborg (IBM SP) Seaborg uses a SP Switch2 switch CPU type: 375 MHz Power 3+ 2 nodes with 4 processors on each node

Benchmarks Sharks and Fish Particle Simulation  At every time step, the forces on each particle are summed up due to all of the other particles.  Particles are distributed evenly among processors  Problem size: 1000 particles Dense Matrix Vector Multiply  Problem size: 1024x1024 matrix of doubles  Matrix layout: 1D by rows

Performance Fine-grainedPrefetchCoarse- grained Sharks and Fish 376 seconds (1x) 221 seconds (1.7x) 223 seconds (1.7x) Matrix Vector Multiply 27354 ms (1x) 10 ms (2735x) 10 ms (2735x) configuration: 8 processors on 2 nodes on seaborg processor speed: 375 MHz Power 3+

Reasons for Speedup Message Coalescing Local Pointer Access Total Speedup Sharks and Fish 152 seconds2 seconds154 seconds Matrix Vector Multiply 27328 ms16 ms27344 ms configuration: 8 processors on 2 nodes on seaborg processor speed: 375 MHz Power 3+

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

Similar presentations

Presentation on theme: "Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

Similar presentations

Presentation on theme: "Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003."— Presentation transcript:

Similar presentations

About project

Feedback