Enabling MPI Interoperability Through Flexible Communication Endpoints

Slides:



Advertisements
Similar presentations
CSE 413: Computer Networks
Advertisements

1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
© 2013 IBM Corporation Implement high-level parallel API in JDK Richard Ning – Enterprise Developer 1 st June 2013.
1 Applets Programming Enabling Application Delivery Via the Web.
Chapter 11 Separate Compilation and Namespaces. Copyright © 2006 Pearson Addison-Wesley. All rights reserved Learning Objectives Separate Compilation.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 12 Introduction to ASP.NET.
Copyright © 2002 Pearson Education, Inc. Slide 1.
Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
MPI 2.2 William Gropp. 2 Scope of MPI 2.2 Small changes to the standard. A small change is defined as one that does not break existing correct MPI 2.0.
1 Processes and Threads Creation and Termination States Usage Implementations.
Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.
Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.
Configuration management
© 2011 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary. Towards a Model-Based Characterization of Data and Services Integration Paul.
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
Chapter 4 Memory Management Basic memory management Swapping
Eiffel: Analysis, Design and Programming Bertrand Meyer (Nadia Polikarpova) Chair of Software Engineering.
1 Designing Hash Tables Sections 5.3, 5.4, Designing a hash table 1.Hash function: establishing a key with an indexed location in a hash table.
Multilevel Page Tables
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Marc Snir Ticket 313 Ticket 310 Ticket 311. TICKET 313 – INIT/FINALIZE 2.
MPI Message Passing Interface
IONA Technologies Position Paper Constraints and Capabilities for Web Services
Processes Management.
Processes Management.
DATAFLOW TESTING DONE BY A.PRIYA, 08CSEE17, II- M.s.c [C.S].
Executional Architecture
Endpoints Proposal Update Jim Dinan MPI Forum Hybrid Working Group June, 2014.
1 Chapter 11: Data Centre Administration Objectives Data Centre Structure Data Centre Structure Data Centre Administration Data Centre Administration Data.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
Practical techniques & Examples
A component- and message-based architectural style for GUI software
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Group-Collective Communicator Creation Ticket #286 Non-Collective Communicator Creation in MPI. Dinan, et al., Euro MPI ‘11.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Tutorial 6 & 7 Symbol Table
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
C++ fundamentals.
The hybird approach to programming clusters of multi-core architetures.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
MPI3 Hybrid Proposal Description
Modular Programming Chapter Value and Reference Parameters t Function declaration: void computesumave(float num1, float num2, float& sum, float&
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
Pointers review Let a variable aa be defined as ‘int *aa;’, what is stored in aa? Let a variable aa be defined as ‘int ** aa;’ what is stored in aa? Why.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.
A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
Endpoints Plenary James Dinan Hybrid Working Group December 10, 2013.
Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.
MPI Communicator Assertions Jim Dinan Point-to-Point WG March 2015 MPI Forum Meeting.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
Fabric Interfaces Architecture – v4
Computer Engg, IIT(BHU)
Chapter 4: Threads.
MPI-Message Passing Interface
Multithreaded Programming
Programming with Shared Memory
Programming with Shared Memory
Programming with Shared Memory Specifying parallelism
Presentation transcript:

Enabling MPI Interoperability Through Flexible Communication Endpoints James Dinan, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur

Mapping of Ranks to Processes in MPI Conventional Communicator Process Process Rank Rank … T T T T T MPI provides a 1-to-1 mapping of ranks to processes This was good in the past, but usage models have evolved Programmers use many-to-one mapping of threads to processes E.g. Hybrid parallel programming with OpenMP/threads Other programming models also use many-to-one mapping Interoperability is a key objective, e.g. with Charm++, etc…

Current Approaches to Hybrid MPI+Threads MPI message matching space: <communicator, sender, tag> Two approaches to using THREAD_MULTIPLE Match specific thread using the tag: Partition the tag space to address individual threads Limitations: Collectives – Multiple threads at a process can’t participate concurrently Wildcards – Multiple threads concurrently requires care Match specific thread using the communicator: Split threads across different communicators (e.g. Dup and assign) Can use wildcards and collectives However, limits connectivity of threads with each other Endpoints effectively adds another component (thread ID) to the match Addresses limitations of current approaches Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"

Impact of Light Cores and Threads on Message Rate Shamelessly stolen from Brian Barrett, et al. [EuroMPI ‘13] Threads sharing a rank increase posted receive queue depth (x-axis) Solution: More ranks! Adding more MPI processes fragments the node Can’t do shared memory programming across the whole node

Endpoints: Flexible Mapping of Ranks to Processes Endpoints Communicator Process Process Process Rank Rank Rank Rank Rank Rank … T T T T T T T Provide a many-to-one mapping of ranks to processes Allows threads to act as first-class participants in MPI operations Improve programmability of MPI + node-level and MPI + system-level models Potential for improving performance of hybrid MPI + X A rank represents a communication “endpoint” Set of resources that supports the independent execution of MPI communications Note: Figure demonstrates many usages, some may impact performance

Impact on MPI Implementations Two implementation strategies Each rank is a distinct network endpoint Ranks are multiplexed on endpoints Effectively adds destination rank to the matching criteria Currently rank is not included, because there is one per process Combination of the above Potential to reduce threading overheads Separate resources per thread Rank can represent distinct network resources Increase HFI/NIC concurrency Separate software state per thread Per-endpoint message queues/matching Split up progress across threads, increase progress engine concurrency Enable per-communicator threading levels COMM_WORLD = THREAD_MULTIPLE, my_comm = THREAD_FUNNELED Process Rank Rank Rank T T T

The Endpoints Programming Interface Interface choices impact performance and usability Key parameter, creation of Endpoints: Static interface Endpoints fixed for entire execution Pro: Allows simpler implementation Con: Interface is restrictive, not usable with libraries Proposed for, but not included in MPI 3.0 Dynamic interface Additional endpoints can be added dynamically Pro: More expressive interface Con: Implementation is not as simple Proposed for MPI <next> Association of endpoints with threads Explicit attach/detach or implicit Goal: Avoid dependence on particular threading packages

Static Endpoint Creation 0 1 2 3 4 MPI_COMM_ENDPOINTS 0 1 MPI_COMM_WORLD Process Process Rank Rank Rank Rank Rank T T T MPI_COMM_ENDPOINTS defined statically New MPI_INIT_ENDPOINTS function “mpiexec --num_ep XX”, requires calling Init for each EP, OOB num_ep E.g. for (ep = 0; ep < my_num_ep) MPI_Init(); Allows simple resource management Creation/freeing/mapping of network endpoints at startup/exit Interface is inflexible Not easy for libraries and apps to both use static endpoints Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"

Dynamic Endpoint Creation 0 1 MPI_COMM_WORLD 0 1 2 3 4 my_ep_comm Process Process Rank Rank Rank Rank Rank T T T Endpoints communicator is created dynamically Through new MPI_COMM_CREATE_ENDPOINTS operation More expressive interface Allows libraries and apps equal access to endpoints Dynamic resource management Endpoints are added/removed dynamically More sophisticated implementation required (Option #2 or #3) Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"

Representation of Endpoints (Static/Dynamic) One handle: MPI_COMM_EP / my_ep_comm Single communicator handle given to parent process How to identify desired endpoints in MPI calls? Threads/processes must attach/detach prior to making an MPI call Endpoint I am using is cached in per-thread state Requires MPI to use thread-local storage (TLS) Adds a TLS lookup on the critical path for every operation N handles: MPI_COMM_EP[MY_EP] / my_ep_comm[MY_EP] Multiple communicator handles, one per endpoint Attach/detach is not needed (but could be helpful) MPI does not need to use TLS Improves interoperability with threading packages

Putting It All Together: Proposed Interface int MPI_Comm_create_endpoints( MPI_Comm parent_comm, int my_num_ep, MPI_Info info, MPI_Comm *out_comm_hdls[]) Each rank in parent_comm gets my_num_ep ranks in out_comm My_num_ep can be different at each process Rank order: process 0’s ranks, process 1’s ranks, etc. Output is an array of communicator handles ith handle corresponds to ith endpoint create by parent process To use that endpoint, use the corresponding handle 1 2 1 2 3 4 Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"

Collectives and Endpoints Endpoints have exactly the same semantics as MPI processes Collective routines must be called by all ranks in the communicator concurrently MPI_THREAD_MULTIPLE required for collectives to be used with endpoints Exception: Freeing the communicator Want to avoid requiring MPI_THREAD_MULTIPLE Allow usages where endpoints are used with MPI_THREAD_FUNNELED The implementation must allow a single thread to free the communicator by calling MPI_COMM_FREE once per endpoint 1 2 1 2 3 4 Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"

Usage Models are Many… Intranode parallel programming with MPI Spawn endpoints off MPI_COMM_SELF Allow true thread multiple, with each thread addressable Spawn endpoints off MPI_COMM_WORLD Obtain better performance Partition threads into groups and assign a rank to each group Performance benefits without partitioning shared memory programming model Interoperability Examples: OpenMP and UPC

Enabling OpenMP Threads in MPI Collectives Hybrid MPI+OpenMP code Endpoints are used to enable OpenMP threads to fully utilize MPI

Enabling UPC+MPI Interoperability: User Code UPC runtime may be using threads within the node UPC compiler substitutes its own world communicator for MPI_COMM_WORLD Can use the PMPI interface, if needed Compiler generates MPI calls needed to give a rank to each UPC thread

Enabling UPC+MPI Interoperability: Generated Code

Flexible Computation Mapping MPI Process MPI Process MPI Process COMM_WORLD 1 2 work_comm 1 2 3 4 5 6 balanced_comm 6 3 5 5 4 5 6 Ranks correspond to work units, e.g., mesh tiles Data exchange between work units maps to communication between ranks Periodic load balancing redistributes work (i.e. ranks) Communication is preserved, because it follows the ranks

Thank you and Acknowledgements We thank the many members of the MPI community and MPI forum who contributed to this work! Review the formal proposal: https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/380 Send comments to MPI Forum’s hybrid working group or james.dinan@intel.com Disclaimer: This presentation represents the views of the authors, and does not necessarily represent the views of Intel.

Endpoints Proposal, Prototype int MPI_Comm_create_endpoints(MPI_Comm parent_comm, int my_num_ep, MPI_Info info, MPI_Comm *out_comm_hdls[]) MPI_Comm_create_endpoints(parent_comm, my_num_ep, info, out_comm_hdls) BIND(C) Type(MPI_Comm), INTENT(IN) :: parent_comm INTEGER, INTENT(IN) :: my_num_ep TYPE(MPI_Info), INTENT(IN) :: info Type(MPI_Comm), INTENT(OUT) :: out_comm_hdls(my_num_ep) INTEGER, OPTIONAL, INTENT(OUT) :: ierror MPI_COMM_CREATE_ENDPOINTS(PARENT_COMM, MY_NUM_EP, INFO, OUT_COMM_HDLS, IERROR) INTEGER PARENT_COMM, MY_NUM_EP, INFO, OUT_COMM_HDLS(*), IERROR

Endpoints Proposal, Text Part 1 This function creates a new communicator from an existing communicator, parent_comm, where my_num_ep ranks in the output communicator are associated with a single calling rank in parent_comm. This function is collective on parent_comm. Distinct handles for each associated rank in the output communicator are returned in the new_comm_hdls array at the corresponding rank in parent_comm. Ranks associated with a process in parent_comm are numbered contiguously in the output communicator, and the starting rank is defined by the order of the associated rank in the parent communicator. If parent_comm is an intracommunicator, this function returns a new intracommunicator new_comm with a communication group of size equal to the sum of the values of my_num_ep on all calling processes. No cached information propagates from parent_comm to new_comm. Each process in parent_comm must call MPI_COMM_CREATE_ENDPOINTS with a my_num_ep argument that ranges from 0 to the value of the MPI_COMM_MAX_ENDPOINTS attribute on parent_comm. Each process may specify a different value for the my_num_ep argument. When my_num_ep is 0, no output communicator is returned. If parent_comm is an intercommunicator, then the output communicator is also an intercommunicator where the local group consists of endpoint ranks associated with ranks in the local group of parent_comm and the remote group consists of endpoint ranks associated with ranks in the remote group of parent_comm. If either the local or remote group is empty, MPI_COMM_NULL is returned in all entries of new_comm_hdls.

Endpoints Proposal, Text Part 2 Ranks in new_comm behave as MPI processes. For example, a collective function on new_comm must be called concurrently on every rank in this communicator. An exception to this rule is made for MPI_COMM_FREE, which must be called for every rank in new_comm, but must permit a single thread to perform these calls serially. Rationale: The concurrency exception for MPI_COMM_FREE is made to enable MPI_COMM_CREATE_ENDPOINTS to be used when the MPI library has not been initialized with MPI_THREAD_MULTIPLE, or when the threading package cannot satisfy the concurrency requirement for collective operations. Advice to Users: Although threads can acquire individual ranks through the MPI_COMM_CREATE_ENDPOINTS function, they still share an instance of the MPI library. Users must ensure that the threading level with which MPI was initialized is maintained. Some operations, such as collective operations, cannot be used by multiple threads sharing an instance of the MPI library, when MPI was initialized with MPI_THREAD_MULTIPLE. Proposed New Error Classes MPI_ERR_ENDPOINTS -- The requested number of endpoints could not be provided. Proposed New Info Keys same_num_ep -- All processes will provide the same my_num_ep argument to MPI_COMM_CREATE_ENDPOINTS.