Min Si[1][2], Antonio J. Peña[1], Jeff Hammond[3], Pavan Balaji[1],

Slides:



Advertisements
Similar presentations
Proposal (More) Flexible RMA Synchronization for MPI-3 Hubert Ritzdorf NEC–IT Research Division
Advertisements

RMA Considerations for MPI-3.1 (or MPI-3 Errata)
MPI-INTEROPERABLE GENERALIZED ACTIVE MESSAGES Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur University of Illinois at Urbana-Champaign Argonne National.
MPI Message Passing Interface
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
KERNEL MEMORY ALLOCATION Unix Internals, Uresh Vahalia Sowmya Ponugoti CMSC 691X.
TOWARDS ASYNCHRONOUS AND MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Darius Buntinas, Judicael Zounmevo, James Dinan, David Goodell, Pavan Balaji, Rajeev.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
OPTIMIZATION STRATEGIES FOR MPI-INTEROPERABLE ACTIVE MESSAGES Xin Zhao, Pavan Balaji, William Gropp, Rajeev Thakur University of Illinois at Urbana-Champaign.
Computer Systems/Operating Systems - Class 8
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
Chapter 13 Embedded Systems
Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.
Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures Min Si PhD student at University of Tokyo, Tokyo, Japan Advisor : Yutaka.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. The experimental resources for this research were.
Bulk Synchronous Parallel (BSP) Model Illustration of a BSP superstep.
User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Distributed Systems. Interprocess Communication (IPC) Processes are either independent or cooperating – Threads provide a gray area – Cooperating processes.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.
Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Distributed Galois Andrew Lenharth 2/27/2015. Goals An implementation of the operator formulation for distributed memory – Ideally forward-compatible.
MPI-3 Hybrid Working Group Status. MPI interoperability with Shared Memory Motivation: sharing data between processes on a node without using threads.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],
Lecture 8 Page 1 CS 111 Online Other Important Synchronization Primitives Semaphores Mutexes Monitors.
A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan
MT-MPI: Multithreaded MPI for Many- Core Environments Min Si 1,2 Antonio J. Peña 2 Pavan Balaji 2 Masamichi Takagi 3 Yutaka Ishikawa 1 1 University of.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Martin Kruliš by Martin Kruliš (v1.0)1.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
A Parallel Communication Infrastructure for STAPL
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
TensorFlow– A system for large-scale machine learning
CS5102 High Performance Computer Systems Thread-Level Parallelism
Outline Other synchronization primitives
Fabric Interfaces Architecture – v4
Linchuan Chen, Xin Huo and Gagan Agrawal
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Department of Computer Science University of California, Santa Barbara
COMP755 Advanced Operating Systems
Presentation transcript:

Casper An Asynchronous Progress Model for MPI RMA on Many-core Architectures Min Si[1][2], Antonio J. Peña[1], Jeff Hammond[3], Pavan Balaji[1], Masamichi Takagi[4], Yutaka Ishikawa[4] [1] Argonne National Laboratory, USA {msi, apenya, balaji}@mcs.anl.gov [3] Intel Labs, USA jeff_hammond@acm.org Have worked at Argonne for 1 year, [2] University of Tokyo, Japan msi@il.is.s.u-tokyo.ac.jp [4] RIKEN AICS, Japan {masamichi.takagi ,yutaka.ishikawa}@riken.jp

Irregular Computations Organized around dense vectors or matrices Regular data movement pattern, use MPI SEND/RECV or collectives More local computation, less data movement Example: stencil computation, matrix multiplication, FFT* Irregular computations Organized around graphs, sparse vectors, more “data driven” in nature Data movement pattern is irregular and data-dependent Growth rate of data movement is much faster than computation Example: social network analysis, bioinformatics Increasing trend of applications are moving to irregular computation models Need more dynamic communication model * FFT : Fast Fourier Transform * The primary contents of this slide are contributed by Xin Zhao.

Message Passing Models Two-sided communication One-sided communication (Remote Memory Access) Process 0 Process 1 Send (data) Receive (data) Process 0 Process 1 += Put (data) Get (data) Acc (data) Computation Feature: Origin (P0) specifies all communication parameters Target (P1) does not explicitly receive or process message Is communication always asynchronous ?

Problems in Asynchronous Progress One-sided operations are not truly one-sided In most platforms (e.g., InfiniBand, Cray) Some operations are hardware supported (e.g., contiguous PUT/GET) Other operations have to be done in software (e.g., 3D accumulates of double precision data) Process 0 Process 1 Software implementation of one-sided operations means that the target process has to make an MPI call to make progress. MPI call Acc(data) += Computation Delay Not TRULY asynchronous ! * RDMA : Remote Direct Memory Access

Traditional Approach of ASYNC Progress (1) Thread-based approach Every MPI process has a communication dedicated background thread Background thread polls MPI progress in order to handle incoming messages for this process Example: MPICH default asynchronous thread, SWAP-bioinformatics Cons: Waste half of computing cores or oversubscribe cores Overhead of Multithreading safety of MPI Process 0 Process 1 Helper thread Computation Acc(data) += P0 T0 P1 T1 P2 T2 P3 T3

Traditional Approach of ASYNC Progress (2) Interrupt-based approach Assume all hardware resources are busy with user computation on target processes Utilize hardware interrupts to awaken a kernel thread and process the incoming RMA messages i.e., Cray MPI, IBM MPI on Blue Gene/P Cons: Overhead of frequent interrupts Process 0 Process 1 Computation Helper thread Acc(data) += Interrupt DMMAP-based ASYNC overhead on Cray XC30

Outline Background & Problem statement Existing Approaches Our solution : CASPER Ensuring Correctness and Performance Evaluation

Casper Process-based ASYNC Progress Multi- and many-core architectures Rapidly growing number of cores Not all of the cores are always keeping busy Process-based asynchronous progress Dedicating arbitrary number of cores to “ghost processes” Ghost process intercepts all RMA operations to the user processes P0 P1 P2 PN … G0 G1 Pros: No overhead caused by multithreading safety or frequent interrupts Flexible core deployment Portable PMPI* redirection Process 0 Process 1 += Computation Acc(data) Ghost Process Infinite cores, do not need interrupt Process 0 Process 1 += Computation Acc(data) MPI call Original communication Communication with Casper * PMPI : name-shifted profiling interface of MPI

Internal Memory mapping Basic Design of Casper Three primary functionalities Transparently replace MPI_COMM_WORLD by COMM_USER_WORLD Shared memory mapping between local user and ghost processes by using MPI-3 MPI_Win_allocate_shared* G 0 1 2 3 4 COMM_USER_WORLD 0 1 2 MPI_COMM_WORLD Redirect RMA operations to ghost processes Ghost Process P1 P2 P1 offset P2 offset Internal Memory mapping P0 P1 += ACC(P1, disp, user_win) Ghost Process for P1 ACC(G0, P1_offset + disp, internal_win) Computation Lock(P1) Lock(G0) lock Recv * MPI_WIN_ALLOCATE_SHARED : Allocates window that is shared among all processes in the window’s group, usually specified with MPI_COMM_TYPE_SHARED communicator.

Ensuring Correctness and Performance Asynchronous progress Correctness Performance Casper MPICH Applications Correctness challenges Lock Permission Management Self Lock Consistency Managing Multiple Ghost Processes Multiple Simultaneous Epochs CrayMPI MVAPICH …

RMA synchronization modes Active-target mode Both origin and target issue synchronization Fence (like a global barrier) PSCW (subgroup of Fence) Passive-target mode Only origin issues synchronization Lock_all (shared) Lock (shared or exclusive) P1 P0 Fence(win) P2 PUT P1 P0 P2 Lock_all(win) Unlock_all(win) PUT P1 P0 Start(P1, win) Post(P0 & P2, win) Wait(P0 & P2, win) Comp(P1, win) PUT P2 P1 P0 Unlock(P1) Lock(P1) P2 PUT

Two origins access two targets sharing the same ghost process [Correctness Challenge 1] Lock Permission Management for Shared Ghost Processes (1) Two origins access two targets sharing the same ghost process An origin accesses two targets sharing the same ghost process [POOR PERF.] Two concurrent lock epochs have to be serialized P0 P1 G0 P2 P3 G1 P2 P3 P2 Lock (P0, win) Unlock(P0, win) Lock (P1, win) Unlock(P1, win) Lock (G0, win) Unlock(G0, win) P3 Serialized Lock (G0, win) Unlock(G0, win) [INCORRECT] Nested locks to the same target P0 P1 G0 P2 P3 G1 Lock (P0, win) Lock (P1, win) Unlock(P0, win) Unlock(P1, win) Lock (G0, win) … MPI standard: An origin cannot nest locks to the same target

User hint optimization [Correctness Challenge 1] Lock Permission Management for Shared Ghost Processes (2) Solution N Windows N = max number of processes on every node COMM. to ith user process on each node goes to ith window User hint optimization Window info “epochs_used” (fence|pscw|lock|lockall by default) If “epochs_used” contains “lock”, create N windows Otherwise, only create a single window WIN[0] P0 P1 G0 P2 P3 G1 1 2 WIN[1] P0 P1 G0 P2 P3 G1 1 2 P0 P1 G0 P2 P3 G1 Such overlapping windows allow Casper to carefully bypass the lock permission management in MPI when accessing different target processes, but still take advantage of them while accessing the same target. While this approach ensures correctness, it can be expensive both for resource usage and performance.

[Correctness Challenge 2] Self Lock Consistency (1) P0 P0 P1 G0 MPI standard: Local lock must be acquired immediately Lock (P0, win) x=1 y=2 … Unlock(P0, win) Lock (G0, win) Unlock(G0, win) MPI standard: Remote lock may be delayed..

[Correctness Challenge 2] Self Lock Consistency (2) Solution (2 steps) Force-lock with HIDDEN BYTES* Lock self User hint optimization Window info no_local_loadstore Do not need both 2 steps Epoch assert MPI_MODE_NOCHECK Only need the 2nd step Lock (G0, win) Get (G0, win) Flush (G0, win) // Lock is acquired // memory barrier for managing // memory consistency Lock (P0, win) * MPI standard defines unnecessary restriction on concurrent GET and accumulate. See MPI Standard Version 3.0 , page page 456, line 39.

[Correctness Challenge 3] Managing Multiple Ghost Processes (1) Lock permission among multiple ghost processes [INCORRECT] Two EXCLUSIVE locks to the same target may be concurrently acquired Lock (EXCLUSIVE, P0, win) PUT(P0) Unlock(P0, win) P2 P0 P1 G0 P2 P3 G2 G1 G3 Lock (EXCLUSIVE, P0, win) PUT(P0) Unlock(P0, win) P3 Serialized Lock (EXCLUSIVE, G0, win) Lock (EXCLUSIVE, G1, win) // get G0 G = randomly_pick_ghost(); PUT(G) … P2 Lock (EXCLUSIVE, G0, win) Lock (EXCLUSIVE, G1, win) // get G1 G = randomly_pick_ghost(); PUT(G) … P3 MPI implementation is allowed to ignore the lock if target does not receive any operation. Empty lock can be ignored, P2 and P3 may concurrently acquire lock on G0 and G1

[Correctness Challenge 3] Managing Multiple Ghost Processes (2) Ordering and Atomicity constraints for Accumulate operations [INCORRECT] Ordering and Atomicity cannot be maintained by MPI among multiple ghost processes P0 P1 G0 G1 x = 2 GET_ACC (x, y, P1) ACC (x, P1) MPI standard: Same origin && same target location accumulates must be ordered x += 2 x += 1 y=4 (correct result is 2) ACC (x, P1) MPI standard: Concurrent accumulates to the same target location must be atomic per basic datatype element. read x = 5 x=5+1 ACC (x, P1) read x = 5 x=5+2

[Correctness Challenge 3] Managing Multiple Ghost Processes (3) Solution (2 phases) Static-Binding Phase Rank binding model Each user process binds to a single ghost process Segment binding model Segment total exposed memory on each node into NG chunks Each chunk binds to a single ghost process Only redirect RMA operations to the bound ghost process Fixed lock and ACC ordering & atomicity issues But only suitable for balanced communication patterns Static-Binding-Free Phase After operation + flush issued, “main lock” is acquired Dynamically select target ghost process Accumulate operations can not be “binding free” P0 P1 P2 G0 G1 P3 Static-rank-binding P0 P1 G0 G1 P3 Static-segment-binding P1 G0 G1 Optimization for dynamic communication patterns Lock(G0, G1) PUT(G0) Flush(G0) Acquired main lock Binding Free Unlock(G0, G1)

[Correctness Challenge 4] Multiple Simultaneous Epochs – Active Epochs (1) Simultaneous fence epochs on disjoint sets of processes sharing the same ghost processes [INCORRECT] Deadlock ! P0 P1 P2 P3 Fence(win0) Epoch 1 Fence(win1) Fence(win0) Epoch 2 Fence(win1) P0 P1 G0 P2 P3 G1 Blocked Fence(win0) Fence(win1) Blocked DEADLOCK !

[Correctness Challenge 4] Multiple Simultaneous Epochs - Active Epochs (2) Solution Every user window has an internal “global window” Translate to passive-target mode Fence PSCW Flush + Send-Receive Performed on user processes Win_allocate Fence(win0) PUT(P) … Fence(win) Win_free Win_allocate Lock_all (global win) Flush_all (global win) + Barrier(COMM_USER_WORLD) + Win_sync PUT(G) … Unlock_all (global win) Win_free [Performance issue 1] [Performance issue 3] User hint MPI_MODE_NOPRECEDE avoids it [Performance issue 2] Only say translation, and have perf issues but solved by user hints User hint (NOSTORE & NOPUT & NOPRECEDE) avoids it

Evaluation Asynchronous Progress Microbenchmark NWChem Quantum Chemistry Application Experimental Environment NERSC's newest supercomputer * Cray XC30 * https://www.nersc.gov/users/computational-systems/edison/configuration/

Asynchronous Progress Microbenchmark RMA implementation in Cray MPI v6.3.1 Test scenario Lock_all (win); for (dst=0; dst<nproc; dst++) { OP(dst, double, cnt = 1, win); Flush(dst, win); busy wait 100us; /*computing*/ } Unlock_all (win) HW-handled OP ASYNC. mode Original mode NONE Thread DMAPP mode Contig. PUT/GET Interrupt Accumulate on Cray XC30 (SW) DMAPP (Interrupt-based async) PUT on Cray XC30 (HW in DMAPP mode) DMAPP (HW PUT) Focus on cray Casper provides asynchronous progress for SW-handled ACC. Casper does not affect the performance of HW PUT

NWChem Quantum Chemistry Application (1) Computational chemistry application suite composed of many types of simulation capabilities. ARMCI-MPI (Portable implementation of Global Arrays over MPI RMA) Focus on most common used CC (coupled-cluster) simulations in a C20 molecules for i in I blocks: for j in J blocks: for k in K blocks: GET block a from A GET block b from B c += a * b /*computing*/ end do ACC block c to C Accumulate block c GET block b GET block a Real user application -> production application Perform DGEMM in local buffer Get-Compute-Update model

Evaluation 2. NWChem Quantum Chemistry Application (2) CCSD iteration in CCSD task Input data file : tce_c20_triplet Platform Configuration: 12-core Intel "Ivy Bridge" (24 cores per node) Carbon C20 Casper ASYNC. Progress helps CCSD performance Core deployment (T) portion in CCSD (T) task # COMP. # ASYNC. Original MPI 24 Casper 20 4 Thread-ASYNC (oversubscribed) Thread-ASYNC (dedicated ) 12 More compute-intensive than CCSD, more improvement

Summary Casper: a process-based asynchronous progress model MPI RMA communication is not truly one-sided Still need asynchronous progress Additional overhead in thread / interrupt-based approaches Multi- / Many-Core architectures Number of cores is growing rapidly, some cores are not always busy Casper: a process-based asynchronous progress model Dedicating arbitrary number of cores to ghost processes Mapping window regions from user processes to ghost processes Redirecting all RMA SYNC. & operations to ghost processes Linking to various MPI implementation through PMPI transparent redirection Download slides: http://sudalab.is.s.u-tokyo.ac.jp/~msi/pdf/ipdps2015-casper-slides.pdf

Backup slides

[Correctness Challenge 4] Multiple Simultaneous Epochs – Lock_all (1) Lock_all only Same translation as that for Fence lock_all in win_allocate, flush_all in unlock_all [INCORRECT] Lock_all and EXCLUSIVE lock on the same window may be concurrently acquired Lock_all (win) PUT(P0) Unlock_all(win) P2 P0 P1 G0 P2 P3 G1 Lock (EXCLUSIVE, P0, win) PUT(P0) Unlock(P0, win) P3 Serialized P2 Lock (EXCLUSIVE, G0, win[0]) PUT(G0) … P3 Lock_all (global win) PUT(G0) Unlock_all(global win) Locks may be acquired concurrently

[Correctness Challenge 3] Multiple Simultaneous Epochs – Lock_all (2) Solution Translate lock_all to a series of locks to all ghost processes P0 P1 G0 P2 P3 G1 Lock_all (win) … P2 Lock (SHARED, G0, win[0]) Lock (SHARED, G0, win[1]) Lock (SHARED, G1, win[0]) Lock (SHARED, G1, win[1]) … P2 // lock P0 // lock P1 // lock P2 // lock P3

[Performance Challenge] Memory Locality Casper internally detects the location of the user processes Only bind the closest ghost processes i.e., P0-2 are bound to G0, P3-5 are bound to G1 Node Node Socket 0 Socket 1 Socket 0 Socket 1 P0 P1 P4 G0 P0 G0 P3 G1 P2 P3 P5 G1 P1 P2 P4 P5 Memory Memory Memory Memory

Performance Issues and optimizations in Fence Win_allocate Fence(win0) PUT(P) … Fence(win) Win_allocate Lockall(global win) Flushall(global win) + MPI_barrier + Win_sync PUT(G) … [Performance issue 1] Fence: local completion of operations issued by this process (as an origin) Flushall + MPI_barrier : remote completion of all operations issued by all processes [Performance issue 3] Memory consistency management User hint MPI_MODE_NOPRECEDE avoid flushall [Performance issue 2] Always need sync between processes for Remote update before fence with local read after fence Local write before fence with remote get after fence Fence PUT y Read y Write x Get x P0 P1 User hint (NOSTORE & NOPUT & NOPRECEDE) avoid barrier

Evaluation. Asynchronous Progress Microbenchmark One-sided communication on InfiniBand cluster (Contiguous PUT/GET are handled in hardware) Accumulate (SW) PUT (HW)