A Parallel Communication Infrastructure for STAPL

Slides:



Advertisements
Similar presentations
Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
Advertisements

MPI Message Passing Interface
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.
TAXI Code Overview and Status Timmie Smith November 14, 2003.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
PContainerARMI Communication Library Oil well logging simulation MPIOpenMPPthreadsNative pAlgorithmspContainers User Application Code pRange STAPL Overview.
STAPL The C++ Standard Template Adaptive Parallel Library Alin Jula Department of Computer Science, Texas A&M Ping An, Silvius Rus, Steven Saunders, Tim.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Mapping Techniques for Load Balancing
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Chapter 5 Implementing UML Specification (Part II) Object-Oriented Technology From Diagram to Code with Visual Paradigm for UML Curtis H.K. Tsang, Clarence.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Chapter 5 Implementing UML Specification (Part II) Object-Oriented Technology From Diagram to Code with Visual Paradigm for UML Curtis H.K. Tsang, Clarence.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Intro to the C++ STL Timmie Smith September 6, 2001.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
1/12 Distributed Transactional Memory for Clusters and Grids EuroTM, Paris, May 20th, 2011 Michael Schöttner.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Pattern Parallel Programming
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Guoliang Chen Parallel Computing Guoliang Chen
Programming with Shared Memory
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Threading And Parallel Programming Constructs
Shared Memory Programming
CSCE569 Parallel Computing
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Threads Chapter 4.
Background and Motivation
User-level Distributed Shared Memory
Distributed Systems CS
Channels.
Introduction to parallelism and the Message Passing Interface
Programming with Shared Memory
Programming with Shared Memory
Channels.
Channels.
Parallel programming in Java
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
Higher Level Languages on Adaptive Run-Time System
An Implementation of User-level Distributed Shared Memory
Emulating Massively Parallel (PetaFLOPS) Machines
Programming Parallel Computers
Presentation transcript:

A Parallel Communication Infrastructure for STAPL Steven Saunders Lawrence Rauchwerger PARASOL Laboratory Department of Computer Science Texas A&M University

Overview STAPL The Parallel Communication Infrastructure the Standard Template Adaptive Parallel Library parallel superset to the C++ Standard Template Library provides transparent communication through parallel containers and algorithms The Parallel Communication Infrastructure foundation for communication in STAPL maintains high performance and portability simplifies parallel programming PARASOL Lab: Texas A&M

Common Communication Models level of abstraction PARASOL Lab: Texas A&M

STL Overview The C++ Standard Template Library set of generic containers and algorithms generically bound by iterators containers: data structures with methods algorithms: operations over a sequence of data iterators: abstract view of data abstracted pointer: dereference, increment, equality Example: std::vector<int> v( 100 ); ...initialize v... std::sort( v.begin(), v.end() ); PARASOL Lab: Texas A&M

STAPL Overview The Standard Template Adaptive Parallel Library set of generic, parallel containers and algorithms generically bound by pRanges pContainers - distributed containers pAlgorithms - parallel algorithms pRange - abstract view of partitioned data view of dist. data, random access, data dependencies Example: stapl::pvector<int> pv( 100 ); ...initialize pv... stapl::p_sort( pv.get_prange() ); PARASOL Lab: Texas A&M

Fundamental Requirements for the Communication Infrastructure statement: tell a process something question: ask a process for something Synchronization mutual exclusion: ensure atomic access event ordering: ensure dependencies are satisfied PARASOL Lab: Texas A&M

Design Goal: abstract underlying communication model to enable efficient support for the requirements focus on parallelism for STL-oriented C++ code Solution: message passing makes C++ STL code difficult shared-memory not yet implemented on large systems remote method invocation can support the communication requirements can support high performance through message passing or shared-memory implementations maps cleanly to object-oriented C++ PARASOL Lab: Texas A&M

Design Communication Synchronization statement: question: template<class Class, class Rtn, class Arg1…> void async_rmi( int destNode, Class* destPtr, Rtn (*method)(Arg1…), Arg1 a1… ) question: Rtn sync_rmi( int destNode, Class* destPtr, groups: void broadcast_rmi(), Rtn collect_rmi() Synchronization mutual exclusion: remote methods are atomic event ordering: void rmi_fence(), void rmi_wait() PARASOL Lab: Texas A&M

Design: Data Transfer Transfer the work to the data only one instance of an object exists at once no replication and merging as in DSM transfer granularity: method arguments pass-by-value: eliminate sharing Argument classes must implement a method that defines its type internal variables are either local or dynamic used to pack/unpack as necessary PARASOL Lab: Texas A&M

Integration with STAPL User Code pAlgorithms pContainers pRange Address Translator Communication Infrastructure Pthreads OpenMP MPI Native PARASOL Lab: Texas A&M

Integration: pContainers Set of distributed sub-containers pContainer methods abstract communication decision between shared-memory/message passing made in communication infrastructure Communication patterns: access: access data in another sub-container handled by sync_rmi update: update data in another sub-container handled async_rmi group update: update all sub-containers handled by broadcast_rmi PARASOL Lab: Texas A&M

Integration: pAlgorithms Set of parallel_task objects input per parallel_task specified by the pRange intermediate results stored in pContainers RMI for communication between parallel_tasks Communication patterns: event ordering: tell workers when to start data parallel: apply operation in parallel followed by a parallel reduction handled by collect_rmi bulk communication: large number of small messages handled async_rmi PARASOL Lab: Texas A&M

Case Study: Sample Sort Common parallel algorithm for sorting based on distributing data into buckets Algorithm: sample from the input a set of splitters send elements to appropriate bucket based on splitters (e.g., elements less than splitter 0 are sent to bucket 0) sort each bucket splitters input buckets PARASOL Lab: Texas A&M

Case Study: Sample Sort //...all processors execute code in parallel... ... stapl::pvector<int> splitters( p-1 ); splitters[id] = //sample stapl::pvector< vector<int> > buckets( p ); for( i=0; i<size; i++ ) { //distribute int dest = //...appropriate bucket based on splitters... stapl::async_rmi( dest, ..., &stapl::pvector::push_back, input[i] ); } stapl::rmi_fence(); sort( bucket[id].begin(), bucket[id].end() ); //sort template<class T> T& pvector<T>::operator[](const int index) { if( /*...index is local...*/ ) return //...element... else return stapl::sync_rmi( /*owning node*/, ..., &stapl::pvector<T>::operator[], index ); PARASOL Lab: Texas A&M

Implementation Issues RMI request scheduling tradeoff local computation with incoming RMI requests current solution: explicit polling async_rmi automatic buffering to reduce network congestion rmi_fence deadlock: native barriers block while waiting must poll while waiting current solution: custom fence implementation completion: RMI’s can invoke other RMI’s... current solution: overlay a distributed termination algorithm PARASOL Lab: Texas A&M

Performance Two implementations: Three benchmark platforms: Pthreads (shared-memory) MPI-1.1 (message passing) Three benchmark platforms: Hewlett Packard V2200 16 processors, shared-memory SMP SGI Origin 3800 48 processors, distributed shared-memory CC-NUMA Linux Cluster 8 processors, 1Gb/s Ethernet switch PARASOL Lab: Texas A&M

Latency Overhead due to high cost of RMI request creation and scheduling versus low cost of communication. Overhead due to native MPI optimizations that are not applicable with RMI. Time (us) to ping-pong a message between two processors using explicit communication or STAPL (async_rmi/sync_rmi) PARASOL Lab: Texas A&M

Effect of Automatic Aggregation PARASOL Lab: Texas A&M

Effect of Automatic Aggregation PARASOL Lab: Texas A&M

Native Barrier vs. STAPL Fence Overhead due to native optimizations within MPI_Barrier that are unavailable to STAPL. Message Passing Majority of overhead due to cost of polling and termination detection. Shared-Memory PARASOL Lab: Texas A&M

Sample Sort for 10M Integers PARASOL Lab: Texas A&M

Sample Sort for 10M Integers PARASOL Lab: Texas A&M

Inner Product for 40M Integers Time (s) to compute the inner product of 40M element vectors using shared-memory PARASOL Lab: Texas A&M

Inner Product for 40M Integers Time (s) to compute the inner product of 40M element vectors using message passing PARASOL Lab: Texas A&M

Conclusion STAPL The Parallel Communication Infrastructure Future Work provides transparent communication through parallel containers and algorithms The Parallel Communication Infrastructure foundation for communication in STAPL maintains high performance and portability simplifies parallel programming Future Work mixed-mode MPI and OpenMP additional implementation issues PARASOL Lab: Texas A&M