Presentation on theme: "Enabling MPI Interoperability Through Flexible Communication Endpoints"— Presentation transcript:
1Enabling MPI Interoperability Through Flexible Communication Endpoints James Dinan, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur
2Mapping of Ranks to Processes in MPI Conventional CommunicatorProcessProcessRankRank…TTTTTMPI provides a 1-to-1 mapping of ranks to processesThis was good in the past, but usage models have evolvedProgrammers use many-to-one mapping of threads to processesE.g. Hybrid parallel programming with OpenMP/threadsOther programming models also use many-to-one mappingInteroperability is a key objective, e.g. with Charm++, etc…
3Current Approaches to Hybrid MPI+Threads MPI message matching space: <communicator, sender, tag>Two approaches to using THREAD_MULTIPLEMatch specific thread using the tag:Partition the tag space to address individual threadsLimitations:Collectives – Multiple threads at a process can’t participate concurrentlyWildcards – Multiple threads concurrently requires careMatch specific thread using the communicator:Split threads across different communicators (e.g. Dup and assign)Can use wildcards and collectivesHowever, limits connectivity of threads with each otherEndpoints effectively adds another component (thread ID) to the matchAddresses limitations of current approachesGo to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
4Impact of Light Cores and Threads on Message Rate Shamelessly stolen from Brian Barrett, et al. [EuroMPI ‘13]Threads sharing a rank increase posted receive queue depth (x-axis)Solution: More ranks!Adding more MPI processes fragments the nodeCan’t do shared memory programming across the whole node
5Endpoints: Flexible Mapping of Ranks to Processes Endpoints CommunicatorProcessProcessProcessRankRankRankRankRankRank…TTTTTTTProvide a many-to-one mapping of ranks to processesAllows threads to act as first-class participants in MPI operationsImprove programmability of MPI + node-level and MPI + system-level modelsPotential for improving performance of hybrid MPI + XA rank represents a communication “endpoint”Set of resources that supports the independent execution of MPI communicationsNote: Figure demonstrates many usages, some may impact performance
6Impact on MPI Implementations Two implementation strategiesEach rank is a distinct network endpointRanks are multiplexed on endpointsEffectively adds destination rank to the matching criteriaCurrently rank is not included, because there is one per processCombination of the abovePotential to reduce threading overheadsSeparate resources per threadRank can represent distinct network resourcesIncrease HFI/NIC concurrencySeparate software state per threadPer-endpoint message queues/matchingSplit up progress across threads, increase progress engine concurrencyEnable per-communicator threading levelsCOMM_WORLD = THREAD_MULTIPLE, my_comm = THREAD_FUNNELEDProcessRankRankRankTTT
7The Endpoints Programming Interface Interface choices impact performance and usabilityKey parameter, creation of Endpoints:Static interfaceEndpoints fixed for entire executionPro: Allows simpler implementationCon: Interface is restrictive, not usable with librariesProposed for, but not included in MPI 3.0Dynamic interfaceAdditional endpoints can be added dynamicallyPro: More expressive interfaceCon: Implementation is not as simpleProposed for MPI <next>Association of endpoints with threadsExplicit attach/detach or implicitGoal: Avoid dependence on particular threading packages
8Static Endpoint Creation MPI_COMM_ENDPOINTSMPI_COMM_WORLDProcessProcessRankRankRankRankRankTTTMPI_COMM_ENDPOINTS defined staticallyNew MPI_INIT_ENDPOINTS function“mpiexec --num_ep XX”, requires calling Init for each EP, OOB num_epE.g. for (ep = 0; ep < my_num_ep) MPI_Init();Allows simple resource managementCreation/freeing/mapping of network endpoints at startup/exitInterface is inflexibleNot easy for libraries and apps to both use static endpointsGo to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
9Dynamic Endpoint Creation MPI_COMM_WORLDmy_ep_commProcessProcessRankRankRankRankRankTTTEndpoints communicator is created dynamicallyThrough new MPI_COMM_CREATE_ENDPOINTS operationMore expressive interfaceAllows libraries and apps equal access to endpointsDynamic resource managementEndpoints are added/removed dynamicallyMore sophisticated implementation required (Option #2 or #3)Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
10Representation of Endpoints (Static/Dynamic) One handle: MPI_COMM_EP / my_ep_commSingle communicator handle given to parent processHow to identify desired endpoints in MPI calls?Threads/processes must attach/detach prior to making an MPI callEndpoint I am using is cached in per-thread stateRequires MPI to use thread-local storage (TLS)Adds a TLS lookup on the critical path for every operationN handles: MPI_COMM_EP[MY_EP] / my_ep_comm[MY_EP]Multiple communicator handles, one per endpointAttach/detach is not needed (but could be helpful)MPI does not need to use TLSImproves interoperability with threading packages
11Putting It All Together: Proposed Interface int MPI_Comm_create_endpoints(MPI_Comm parent_comm,int my_num_ep,MPI_Info info,MPI_Comm *out_comm_hdls)Each rank in parent_comm gets my_num_ep ranks in out_commMy_num_ep can be different at each processRank order: process 0’s ranks, process 1’s ranks, etc.Output is an array of communicator handlesith handle corresponds to ith endpoint create by parent processTo use that endpoint, use the corresponding handle121234Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
12Collectives and Endpoints Endpoints have exactly the same semantics as MPI processesCollective routines must be called by all ranks in the communicator concurrentlyMPI_THREAD_MULTIPLE required for collectives to be used with endpointsException: Freeing the communicatorWant to avoid requiring MPI_THREAD_MULTIPLEAllow usages where endpoints are used with MPI_THREAD_FUNNELEDThe implementation must allow a single thread to free the communicator by calling MPI_COMM_FREE once per endpoint121234Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
13Usage Models are Many… Intranode parallel programming with MPI Spawn endpoints off MPI_COMM_SELFAllow true thread multiple, with each thread addressableSpawn endpoints off MPI_COMM_WORLDObtain better performancePartition threads into groups and assign a rank to each groupPerformance benefits without partitioning shared memory programming modelInteroperabilityExamples: OpenMP and UPC
14Enabling OpenMP Threads in MPI Collectives Hybrid MPI+OpenMP codeEndpoints are used to enable OpenMP threads to fully utilize MPI
15Enabling UPC+MPI Interoperability: User Code UPC runtime may be using threads within the nodeUPC compiler substitutes its own world communicator for MPI_COMM_WORLDCan use the PMPI interface, if neededCompiler generates MPI calls needed to give a rank to each UPC thread
17Flexible Computation Mapping MPI ProcessMPI ProcessMPI ProcessCOMM_WORLD12work_comm123456balanced_comm6355456Ranks correspond to work units, e.g., mesh tilesData exchange between work units maps to communication between ranksPeriodic load balancing redistributes work (i.e. ranks)Communication is preserved, because it follows the ranks
18Thank you and Acknowledgements We thank the many members of the MPI community and MPI forum who contributed to this work!Review the formal proposal:https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/380Send comments to MPI Forum’s hybrid working group orDisclaimer: This presentation represents the views of the authors, and does not necessarily represent the views of Intel.
20Endpoints Proposal, Text Part 1 This function creates a new communicator from an existing communicator, parent_comm, where my_num_ep ranks in the output communicator are associated with a single calling rank in parent_comm. This function is collective on parent_comm. Distinct handles for each associated rank in the output communicator are returned in the new_comm_hdls array at the corresponding rank in parent_comm. Ranks associated with a process in parent_comm are numbered contiguously in the output communicator, and the starting rank is defined by the order of the associated rank in the parent communicator. If parent_comm is an intracommunicator, this function returns a new intracommunicator new_comm with a communication group of size equal to the sum of the values of my_num_ep on all calling processes. No cached information propagates from parent_comm to new_comm. Each process in parent_comm must call MPI_COMM_CREATE_ENDPOINTS with a my_num_ep argument that ranges from 0 to the value of the MPI_COMM_MAX_ENDPOINTS attribute on parent_comm. Each process may specify a different value for the my_num_ep argument. When my_num_ep is 0, no output communicator is returned. If parent_comm is an intercommunicator, then the output communicator is also an intercommunicator where the local group consists of endpoint ranks associated with ranks in the local group of parent_comm and the remote group consists of endpoint ranks associated with ranks in the remote group of parent_comm. If either the local or remote group is empty, MPI_COMM_NULL is returned in all entries of new_comm_hdls.
21Endpoints Proposal, Text Part 2 Ranks in new_comm behave as MPI processes. For example, a collective function on new_comm must be called concurrently on every rank in this communicator. An exception to this rule is made for MPI_COMM_FREE, which must be called for every rank in new_comm, but must permit a single thread to perform these calls serially. Rationale: The concurrency exception for MPI_COMM_FREE is made to enable MPI_COMM_CREATE_ENDPOINTS to be used when the MPI library has not been initialized with MPI_THREAD_MULTIPLE, or when the threading package cannot satisfy the concurrency requirement for collective operations. Advice to Users: Although threads can acquire individual ranks through the MPI_COMM_CREATE_ENDPOINTS function, they still share an instance of the MPI library. Users must ensure that the threading level with which MPI was initialized is maintained. Some operations, such as collective operations, cannot be used by multiple threads sharing an instance of the MPI library, when MPI was initialized with MPI_THREAD_MULTIPLE. Proposed New Error Classes MPI_ERR_ENDPOINTS -- The requested number of endpoints could not be provided. Proposed New Info Keys same_num_ep -- All processes will provide the same my_num_ep argument to MPI_COMM_CREATE_ENDPOINTS.