Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.

Slides:

Advertisements

Similar presentations

Enabling MPI Interoperability Through Flexible Communication Endpoints

Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Practical techniques & Examples

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

Group-Collective Communicator Creation Ticket #286 Non-Collective Communicator Creation in MPI. Dinan, et al., Euro MPI ‘11.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

Introduction to Parallel Rendering: Sorting, Chromium, and MPI Mengxia Zhu Spring 2006.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures Min Si PhD student at University of Tokyo, Tokyo, Japan Advisor : Yutaka.

A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.

1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.

LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

The Global View Resilience Model Approach GVR (Global View for Resilience) Exploits a global-view data model, which enables irregular, adaptive algorithms.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

MRPGA ： An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter ：古乃卉.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Ohio State Univ Effective Automatic Parallelization of Stencil Computations * Sriram Krishnamoorthy 1 Muthu Baskaran 1, Uday Bondhugula 1, Atanas Rountev.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],

Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.

Non-Collective Communicator Creation Tickets #286 and #305 int MPI_Comm_create_group(MPI_Comm comm, MPI_Group group, int tag, MPI_Comm *newcomm) Non-Collective.

Multilevel Parallelism using Processor Groups Bruce Palmer Jarek Nieplocha, Manoj Kumar Krishnan, Vinod Tipparaju Pacific Northwest National Laboratory.

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.

Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Endpoints Plenary James Dinan Hybrid Working Group December 10, 2013.

CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Non-Collective Communicator Creation Ticket #286 int MPI_Comm_create_group(MPI_Comm comm, MPI_Group group, int tag, MPI_Comm *newcomm) Non-Collective Communicator.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Parallel Density-based Hybrid Clustering

Parallel Nested Monte-Carlo Search

Department of Computer Science University of California, Santa Barbara

Outline System architecture Current work Experiments Next Steps

Department of Computer Science University of California, Santa Barbara

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju 2, and Abhinav Vishnu 2 1 Argonne National Laboratory 2 Pacific Northwest National Laboratory

Outline 1.Non-collective communicator creation –Result of Global Arrays/ARMCI on MPI one-sided work –GA/ARMCI flexible process groups Believed impossible to support on MPI 2.Case study: MCMC Load Balancing –Dynamical nucleation theory Monte Carlo application –Malleable multi-level parallel load balancing work –Collaborators: Humayun Arafat, P. Sadayappan (OSU), Sriram Krishnamoorthy (PNNL), Theresa Windus (ISU) 2

MPI Communicators  Core concept in MPI: –All communication is encapsulated within a communicator –Enables libraries that doesn’t interfere with application  Two types of communicators –Intracommunicator – Communicate within one group –Intercommunicator – Communicate between two groups  Communicator creation is collective –Believed that “non-collective” creation can’t be supported by MPI 3

Non-Collective Communicator Creation  Create a communicator collectively only on new members  Global Arrays process groups –Past: collectives using MPI Send/Recv  Overhead reduction –Multi-level parallelism –Small communicators when parent is large  Recovery from failures –Not all ranks in parent can participate  Load balancing 4 

Intercommunicator Creation  Intercommunicator creation parameters –Local comm – All ranks participate –Peer comm – Communicator used to identify remote leader –Local leader– Local rank that is in both local and peer comms –Remote leader– “Parent” communicator containing peer group leader –Safe tag – Communicate safely on peer comm 5 Intracommunicator Intercommunicator Local Remote

Non-Collective Communicator Creation Algorithm 6 MPI_COMM_SELF (intracommunicator) MPI_Intercomm_create (intercommunicator) MPI_Intercomm_merge ✓

Non-Collective Algorithm in Detail RR 7 Translate group ranks to ordered list of ranks on parent communicator Calculate my group ID Left group Right group

Experimental Evaluation  IBM Blue Gene/P at Argonne –Node: 4 Core, 850 MHz PowerPC 450 processor, 2GB memory –Rack: 1024 Nodes –40 Racks = 163,840 cores –IBM MPI, derived from MPICH2  Currently limited to two racks –Bug in MPI intercommunicators; still looking into it 8

Evaluation: Microbenchmark 9 O(log 2 g) O(log p)

Case Study: Markov Chain Monte Carlo  Dynamical nucleation theory Monte Carlo (DNTMC) –Part of NWChem –Markov chain Monte Carlo  Multiple levels of parallelism –Multi-node “Walker” groups –Walker: N random samples –Concurrent, sequential traversals  Load imbalance –Sample computation (energy calculation) time is irregular –Sample can be rejected 10 T L Windus et al 2008 J. Phys.: Conf. Ser

MCMC Load Balancing  Inner parallelism is malleable –Group computation scales as nodes added to walker group  Regroup idle processes with active groups  Preliminary results: –30% decrease in total application execution time 11

… … MCMC Benchmark Kernel  Simple MPI-only benchmark  “Walker” groups –Initial group size: G = 4  Simple work distribution –Samples: S = (leader % 32) * 10,000 –Sample time: t s = 100 ms / |G| –Weak scaling benchmark  Load balancing –Join right when idle –Asynchronous: Check for merge request after each work unit –Synchronous: Collectively regroup every i samples 12

Evaluation: Load Balancing Benchmark Results 13

Proposed MPI Extensions  Eliminate O(log g) factor by direct implementation 1.Multicommunicator –Multiple remote groups –Flatten into an intercommunicator 1.MPI_Group_comm_create(MPI_Comm in, MPI_Group grp, int tag, MPI_Comm *out) –Tag allows for safe communication and distinction of calls –Available in MPICH2 (MPIX), used by GA –Proposed for MPI-3 14 Local Remote

Conclusions  Algorithm for non-collective communicator creation –Recursive intercommunicator creation and merging –Benefits: Support for GA groups Lower overhead for small communicators Recovering from failures Load balancing (MCMC benchmark)  Cost is high, proposed extension to MPI –Multicommunicators –MPI_Group_comm_create(…) Contact: Jim Dinan -- 15

16 Additional Slides

Asynchronous Load Balancing … …

Collective Load Balancing (i = 2) … …

Number of Regrouping Operations on 8k Cores Load Balancing iAverageSandard Deviation MinMax None Asynch Collective Collective Collective Collective Collective Collective Collective Collective