TOWARDS ASYNCHRONOUS AND MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Darius Buntinas, Judicael Zounmevo, James Dinan, David Goodell, Pavan Balaji, Rajeev.

Slides:

Advertisements

Similar presentations

Proposal (More) Flexible RMA Synchronization for MPI-3 Hubert Ritzdorf NEC–IT Research Division

Advertisements

Read Modify Write for MPI Howard Pritchard (Cray) Galen Shipman (ORNL)

MPI 3 RMA Bill Gropp and Rajeev Thakur. 2 Why Change RMA? Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations.

MPI3 RMA William Gropp Rajeev Thakur. 2 MPI-3 RMA Presented an overview of some of the issues and constraints at last meeting Homework - read Bonachea's.

Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.

RMA Considerations for MPI-3.1 (or MPI-3 Errata)

MPI-INTEROPERABLE GENERALIZED ACTIVE MESSAGES Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur University of Illinois at Urbana-Champaign Argonne National.

MPI Message Passing Interface

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

Enabling MPI Interoperability Through Flexible Communication Endpoints

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.

Threads. Objectives To introduce the notion of a thread — a fundamental unit of CPU utilization that forms the basis of multithreaded computer systems.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

OPTIMIZATION STRATEGIES FOR MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur University of Illinois at Urbana-Champaign.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.

Introduction in algorithms and applications Introduction in algorithms and applications Parallel machines and architectures Parallel machines and architectures.

Argonne National Laboratory School of Computing and SCI Institute, University of Utah Formal Verification of Programs That Use MPI One-Sided Communication.

Concurrency CS 510: Programming Languages David Walker.

Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Bulk Synchronous Parallel (BSP) Model Illustration of a BSP superstep.

DIANE Overview Germán Carrera, Alfredo Solano (CNB/CSIC) EMBRACE COURSE Monday 19th of February to Friday 23th. CNB-CSIC Madrid.

Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.

Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Managed by UT-Battelle for the Department of Energy MPI for MultiCore and ManyCore Galen Shipman Oak Ridge National Laboratory June 4, 2008.

Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.

RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

1 Causal-Consistent Reversible Debugging Ivan Lanese Focus research group Computer Science and Engineering Department University of Bologna/INRIA Bologna,

Min Si[1][2], Antonio J. Peña[1], Jeff Hammond[3], Pavan Balaji[1],

Message Oriented Communication Prepared by Himaja Achutha Instructor: Dr. Yanqing Zhang Georgia State University.

Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

SOFTWARE DESIGN AND ARCHITECTURE LECTURE 13. Review Shared Data Software Architectures – Black board Style architecture.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 4: Threads Modified from the slides of the text book. TY, Sept 2010.

Of 24©2005 Kathryn Mohror Enabling MPI-2 Support in Paradyn Kathryn Mohror Karen L. Karavanic Portland State University Enabling MPI-2 Support in Paradyn.

MPI Advanced edition Jakub Yaghob. Initializing MPI – threading int MPI Init(int *argc, char ***argv, int required, int *provided); Must be called as.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

1 ParallelAlgorithms Parallel Algorithms Dr. Stephen Tse Lesson 9.

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

A Parallel Communication Infrastructure for STAPL

Realizing Concurrency using the thread model

Realizing Concurrency using the thread model

Accelerating Large Charm++ Messages using RDMA

Last Class: RPCs and RMI

Chapter 4: Threads.

Realizing Concurrency using the thread model

Bulk Synchronous Parallel (BSP) Model Illustration of a BSP superstep.

Chapter 4: Threads.

5- Message-Passing Programming

Presentation transcript:

TOWARDS ASYNCHRONOUS AND MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Darius Buntinas, Judicael Zounmevo, James Dinan, David Goodell, Pavan Balaji, Rajeev Thakur, Ahmad Afsahi, William Gropp University of Illinois at Urbana-Champaign, Argonne National Laboratory, Queen’s University 13 th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Motivation Many new important large-scale applications Bioinformatics, social network analysis 2

Motivation Many new important large-scale applications Bioinformatics, social network analysis Different from “traditional” applications Many small messages sent to random targets Communication pattern is irregular Computation is data-driven 3

Motivation Many new important large-scale applications Bioinformatics, social network analysis Different from “traditional” applications Many small messages sent to random nodes Communication pattern is irregular Computation is data-driven Approaches for “traditional” applications are not well-suited 4

while any rank’s queue is not empty: for i in ranks: out_queue[i] ⟵ empty for vertex v in my queue: if color(v) is white: color(v) ⟵ black for vertex w in neighbors(v): append w to out_queue[owner(w)] Breath-First Search for i in ranks: start receiving in_queue[i] from rank i for j in ranks: start sending out_queue[j] to rank j synchronize and finish communications rank 0rank 1rank 2 out in 5

Active Messages (AM) 6 rank 0 rank 1 Messages handler Reply Reply handler Proposed by von Eicken et al for Split-C in 1992 Sender explicitly sends message Upon message’s arrival, message handler is triggered Receiver is not explicitly involved A suitable paradigm for data-driven applications Data is sent immediately Communication is asynchronous

Breadth-First Search in Active Messages while queue is not empty: new_queue ⟵ empty begin AM epoch for vertex v in queue: for vertex w in neighbors(v) send AM to owner(w) end AM epoch queue ⟵ new_queue AM_handler (vertex v) if color(v) is white: color(v) ⟵ black insert v to new_queue AM_handler (vertex v) if color(v) is white: color(v) ⟵ black insert v to new_queue rank 0rank 1rank 2 7

Motivation Most applications are MPI-based, but we don’t want to rewrite the entire application An incremental approach is needed to modify the application 8

Motivation Most applications are MPI-based, but we don’t want to rewrite the entire application An incremental approach is needed to modify the application This paper tackles the challenge of supporting Active Messages within MPI framework, so that both traditional MPI communication and Active Messages can be simultaneously utilized by applications 9

API DESIGN

Challenges Active Messages should work correctly with other MPI messages Memory consistency semantics Consistency between two different active messages 11

Challenges Active Messages should work correctly with other MPI messages Memory consistency semantics Consistency between two different active messages Leverage MPI One-sided (RMA) Interface, which already address those challenges 12

MPI RMA Interface 13 “One-sided” communication Origin process specifies all communication parameters Target process exposes memory (“window”) accessed by other processes Three basic operations: Put / Get / Accumulate Accumulate: simple updates on target process MPI_PutMPI_GetMPI_Accumulate

API Design Extend MPI_Accumulate to support user function User-defined function MPI_User_function (void *invec, void *inoutvec, int *len, MPI_Datatype *dtype) 14

API Design Extend MPI_Accumulate to support user function User-defined function MPI_User_function (void *invec, void *inoutvec, int *len, MPI_Datatype *dtype) Operation creation MPI_Op_create (MPI_User_function *user_fn, int commute, MPI_Op *user_op) 15

API Design Extend MPI_Accumulate to support user function User-defined function MPI_User_function (void *invec, void *inoutvec, int *len, MPI_Datatype *dtype) Operation creation MPI_Op_create (MPI_User_function *user_fn, int commute, MPI_Op *user_op) Operation registration MPIX_Op_register (MPI_Op user_op, int id, MPI_Win win) Collective call on the window 16

rank 1 rank 0 Example void func0 (void *in, void *inout, int *len, MPI_Datatype *dtype); MPI_Win_create (…, &win); MPI_Op_create (func1,..., &user_op); MPIX_Op_register (user_op, 0, win); /* func1 is triggered at rank1 */ MPIX_Op_deregister (user_op, win); MPI_Win_free (&win); MPI_Win_create (…, &win); MPI_Op_create (func0,..., &user_op); MPI_Win_lock (…, 1, …, win); MPI_Accumulate (…, 1, …, user_op, win); MPI_Win_unlock (1, win); MPIX_Op_deregister (user_op, win); MPI_Win_free (&win); void func1 (void *in, void *inout, int *len, MPI_Datatype *dtype); MPIX_Op_register (user_op, 0, win); func0 and func1 have the same functionality 17 MPIX_Op_register (user_op, 0, win);

ASYNCHRONOUS PROCESSING

Asynchronous execution models “NON-ASYNC” No asynchronous processing, messages are processed by single thread Disadvantage: messages cannot be processed upon arrival 19

Asynchronous execution models “NON-ASYNC” No asynchronous processing, messages are processed by single thread Disadvantage: messages cannot be processed upon arrival “THREAD-ASYNC” Asynchronous processing is provided by thread on top of MPI library Disadvantage: “active polling” for intra-node messages 20

Asynchronous execution models “NON-ASYNC” No asynchronous processing, messages are processed by single thread Disadvantage: messages cannot be processed upon arrival “THREAD-ASYNC” Asynchronous processing is provided by thread on top of MPI library Disadvantage: “active polling” for intra-node messages “INTEGRATED-ASYNC” Asynchronous processing is supported internally from MPI library 21

Inter-node Communication Thread-based approach Spawn thread in TCP network module Block polling for AM Separate sockets for MPI messages and AM spawn AM thread 22 MPI AM

Intra-node Communication “Origin computation” During window creation, processes on the same node allocate a shared-memory region Origin process directly fetches data from target process, completes computation locally and writes data back to target process 23

Network Integrated Idea rank 0 (target) rank 1 (origin) rank 2 (origin) Shared-memory NODE 0NODE 1 24

Network Integrated Idea Shared-memory NODE 0NODE 1 25 rank 0 (target) rank 1 (origin) rank 2 (origin)

Network Integrated Idea Shared-memory NODE 0NODE 1 26 rank 0 (target) rank 1 (origin) rank 2 (origin)

Network Integrated Idea Shared-memory NODE 0NODE 1 27 rank 0 (target) rank 1 (origin) rank 2 (origin)

Network Integrated Idea Shared-memory NODE 0NODE 1 28 rank 0 (target) rank 1 (origin) rank 2 (origin)

Network Integrated Idea Shared-memory NODE 0NODE 1 29 rank 0 (target) rank 1 (origin) rank 2 (origin)

PERFORMANCE

Stencil Benchmark 31 Number of processes: 2x2 to 20x20, 8 processes per node (“Fusion” Cluster at ANL: 320 nodes, 8 cores per node, QDR Infiniband)

Graph500 Benchmark 32 (TEPS: Traversed Edges Per Second, higher is better) Breadth-First Search (BFS) Large number of small messages to random targets Optimization: origin process accumulates local data and sends only one message to each target process

Conclusion Why to support Active Messages in MPI Leverage MPI RMA Interface API Design Asynchronous processing Intra-node messages Inter-node messages Stencil benchmark and Graph500 benchmark 33

QUESTIONS

BACKUP SLIDES

Message Passing Models Two-sided communication Explicit sends and receives One-sided communication Explicit sends, implicit receives Simple operations on remote process Active Messages Explicit sends, implicit receives User-defined operations on remote process 36

Active Message Models Libraries and runtime systems Low level: GASNet, IBM’s LAPI, DCMF and PAMI, AMMPI High level: Charm++, X10 Middle: AM++ AM++ AMMPI

MPI RMA Interface Two synchronization modes Unlock origin target ACC(X) epoch Lock PUT(Y) epoch Complete origin target ACC(X) Start PUT(Y) Post Wait Passive Target Mode (Lock-Unlock) Active Target Mode (PSCW, Fence)

Inter-node Communication 39 Separate sockets for Active Messages origin_sc rma_sc main thread AM thread

Inter-node Communication Separate sockets for Active Messages origin_sc rma_send_sc main thread AM thread rma_recv_sc rma_resp_sc