Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

Instruction Set Design
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Introductory Courses in High Performance Computing at Illinois David Padua.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Computer Architecture II 1 Computer architecture II Lecture 9.
Multiscalar processors
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.
A Behavioral Memory Model for the UPC Language Kathy Yelick Joint work with: Dan Bonachea, Jason Duell, Chuck Wallace.
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Evaluation of Memory Consistency Models in Titanium.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Thread-Level Speculation Karan Singh CS
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Communication Optimizations in Titanium Programs Jimmy Su.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick
CS5102 High Performance Computer Systems Thread-Level Parallelism
Memory Consistency Models
Lecture 11: Consistency Models
Memory Consistency Models
Multiprocessor Cache Coherency
Example Cache Coherence Problem
Operating System Concepts
Lecture 3: Main Memory.
Programming with Shared Memory Specifying parallelism
Presentation transcript:

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003

Global Address Space The languages share the global address space abstraction –Shared memory is partitioned by processors –Remote memory may stay remote: no automatic caching implied –One-sided communication through reads/writes of shared variables –Both individual and bulk memory copies Shared Global address space X[0] Private ptr: X[1]X[P]

Unified Parallel C (UPC) UPC is a parallel extension to C for scientific computing –With distributed arrays, shared pointers, parallel loops, strict/relaxed memory model. –Global Address Space Abstraction SPMD parallelism Fixed number of threads Each thread keeps its own private data

Overview of Berkeley UPC Compiler Translator UPC Code Translator Generated C Code Berkeley UPC Runtime System GASNet Communication System Network Hardware Platform- independent Network- independent Compiler- independent Language- independent Two Goals: Portability and High-Performance Open64 based

Motivation – Network Latencies UPC accessesPrivateLocal sharedRemote No. Cycles 1-2~10> 2K

Motivation – Small v. Large Messages

The Need for Software Caching High network latencies for GAS languages  encourage fine-grained programs with shared pointer-based communication Want to bridge the performance gap between fine- and coarse-grained applications Caching helps by eliminating redundant accesses and prefetching adjacent data. Compiler Analysis: difficult and imprecise. Hardware/DSM: portability, overhead, granularity. Use a software controlled cache inside the Berkeley UPC runtime.

Challenges: UPC Memory Consistency Model UPC has both strict and relaxed memory model  Strict: Like sequential consistency  Relaxed: Behave like single thread variables Two rules: 1.All shared accesses satisfy local data dependencies. 2.Program order of strict accesses must be maintained among all threads. Compiler can aggressively optimize relaxed accesses, and programmers can count on program order of strict accesses.

Problem with the Current Model Write X = 1 Read Y Write Y = 1 Read X (initially x = y = 0) Thread 0Thread 1 Read X Read YRead XWrite X = 1 Write Y = 1 Results of T1’s reads are (0, 1, 0), violates UPC memory model (T1 must observe the write to X before the strict write to Y) Possible Execution Order With Caching Time Cache miss (0) Read X Cache miss (1) Cache hit (0)

A New UPC Memory Model The current model is too restrictive for compiler optimizations that reorder memory accesses. Problem was that relaxed accesses can create data races with other threads’ strict accesses. A new definition based on Weak Ordering:  An UPC implementation must appear sequentially consistent to any programs that do not have data races involving relaxed accesses. sync Relaxed access

Caching and the New Model Model forbids data races on relaxed accesses  Still ok for strict accesses, so doesn’t limit expressiveness of the language. Permits an efficient caching implementation  Caches remote relaxed accesses  Flush the cache and sync pending writes at synchronization points (barriers, locks, fences, strict accesses)  A “local knowledge” scheme, so no need for coherence messages.

Cache Blocks Block descriptors Hashtable (shared address -> block descriptor) Pending reads Pending writes (per UPC target thread) Cache Organization

Cache Operations Writes:  find and sync any conflicting pending block reads  initiate network message for the write  write to cache block (if resident)  store write request in list of pending writes Read miss:  allocate cache block (evicting if necessary)  find and sync any conflicting pending writes  initiate remote cache block read  chain pending request off of descriptor Cache flush: at any strict read/write, barrier, etc.,  Just bzero hashtable and block status bits

Cache Operations Bulk and unaligned accesses  Unlike hardware cache, accesses of arbitrary length and start address  Fastpath assumes all accesses fit into a single cache block  Bulk operations usually imply hand-optimization: bypass cache Bulk writes: (same as small, but affect multiple blocks)  find and sync any conflicting pending block reads  initiate network message for the write  write to cache block (if resident)  store write request in list of pending writes Bulk reads:  Can completely bypass cache mechanism New memory model allows older values to remain in cache

Write-packing buffer Addr1 Data1 Len1 Addr 2 Addr1 Data1Data2 Len2Len1 Delay writes, and place into per-node write-packing buffer. -store addr/len info separately from the data Pack later messages around earlier ones. -results in fewer, larger messages

Write-packing buffer (cont.) Addr1 Data1 + Data2 Len1+2 Contiguous writes can be coalesced into a single message - just change the length field Strict writes can be packed into buffer, as long as remote side unpacks writes in correct order Min/max address can be kept to speed up conflict checks Also, trivially handles write-after-write conflicts: will be useful if conflict detection moved from compile-time to run- time

Preliminary Results LocalRemoteCache hits Time (ns)

Array Prefetching It is unlikely that a fine-grained program can match the performance of an equivalent coarse-grained program with runtime caching alone. Goal is to further bridge the gap between fine- grained programs and coarse-grained programs with prefetching. We implemented prefetching for regular array accesses.

Example A is a remote array Fine-grained Version sum = 0; foreach (p in A.domain()){ sum += A[p]; } Coarse-grained Version sum = 0; double [1d] local copyA; copyA = new double[A.domain()]; copyA.copy(A); foreach (p in copyA.domain()){ sum += copyA[p]; }

Titanium Memory Model 1. Locally sequentially consistent. For a single processor, all reads and writes to a given memory location must appear to occur in exactly the order specified. 2. Globally consistent at synchronization events. At a global synchronization event, such as a barrier, all processors must agree on the values of all the variables. At a non-global synchronization event, such as entry into a critical section, the processor must see all previous updates made using that synchronization event.

Implementation Identify array accesses inside of foreach loops as candidates for prefetching  The foreach loop is a full domain loop  The array access appears on every iteration of the loop  The addresses touched by the array access can be computed from the iteration domain and loop invariant pointer increments. Insert code in the loop setup for prefetching elements of the array that will be used during the foreach loop. Change array references inside of the loop to local ones

Implementation Need to make sure that sychronization operations do not get called during the execution of the foreach loop Flush the prefetched data at sychronization points. Resolve conflicts caused by remote writes and array copies by merging in the changes into the prefetched data

Benefits Message coalescing Local pointer accesses vs. global pointer accesses  Global pointer access handles the general case, where the data can be remote or local. Location of data is checked during runtime.  Local pointer access translates into a simple pointer dereference in C.

Configuration Seaborg (IBM SP) Seaborg uses a SP Switch2 switch CPU type: 375 MHz Power 3+ 2 nodes with 4 processors on each node

Benchmarks Sharks and Fish Particle Simulation  At every time step, the forces on each particle are summed up due to all of the other particles.  Particles are distributed evenly among processors  Problem size: 1000 particles Dense Matrix Vector Multiply  Problem size: 1024x1024 matrix of doubles  Matrix layout: 1D by rows

Performance Fine-grainedPrefetchCoarse- grained Sharks and Fish 376 seconds (1x) 221 seconds (1.7x) 223 seconds (1.7x) Matrix Vector Multiply ms (1x) 10 ms (2735x) 10 ms (2735x) configuration: 8 processors on 2 nodes on seaborg processor speed: 375 MHz Power 3+

Reasons for Speedup Message Coalescing Local Pointer Access Total Speedup Sharks and Fish 152 seconds2 seconds154 seconds Matrix Vector Multiply ms16 ms27344 ms configuration: 8 processors on 2 nodes on seaborg processor speed: 375 MHz Power 3+