A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan

Slides:

Advertisements

Similar presentations

Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.

Advertisements

Enabling MPI Interoperability Through Flexible Communication Endpoints

Lecture 6: Multicore Systems

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Multithreaded Programming.

Threads. Objectives To introduce the notion of a thread — a fundamental unit of CPU utilization that forms the basis of multithreaded computer systems.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Database Design Concepts Info1408

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

3.5 Interprocess Communication

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Chapter 4: Threads. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Objectives Thread definitions and relationship to process Multithreading.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Chapter 4: Threads READ 4.1 & 4.2 NOT RESPONSIBLE FOR 4.3 &

Prevalent Database Models (Advantages of a database over flat files)

Chapter 4: Threads. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 4: Threads Overview Multithreading Models Threading Issues.

CSE 490dp Resource Control Robert Grimm. Problems How to access resources? –Basic usage tracking How to measure resource consumption? –Accounting How.

1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.

The hybird approach to programming clusters of multi-core architetures.

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Threads.

Task Farming on HPCx David Henty HPCx Applications Support

MPI3 Hybrid Proposal Description

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.

國立台灣大學資訊工程學系 Chapter 4: Threads. 資工系網媒所 NEWS 實驗室 Objectives To introduce the notion of a thread — a fundamental unit of CPU utilization that forms the.

Chapter 4: Threads. 4.2CSCI 380 Operating Systems Chapter 4: Threads Overview Multithreading Models Threading Issues Pthreads Windows XP Threads Linux.

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Java Threads 11 Threading and Concurrent Programming in Java Introduction and Definitions D.W. Denbo Introduction and Definitions D.W. Denbo.

Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing.

Concurrent Aggregates (CA) Andrew A. Chien and William J. Dally Presented by: John Lynn and Ryan Wu.

Hybrid MPI and OpenMP Parallel Programming

Modeling Component-based Software Systems with UML 2.0 George T. Edwards Jaiganesh Balasubramanian Arvind S. Krishna Vanderbilt University Nashville, TN.

MPI-3 Hybrid Working Group Status. MPI interoperability with Shared Memory Motivation: sharing data between processes on a node without using threads.

Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.

CS333 Intro to Operating Systems Jonathan Walpole.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

Data Sharing. Data Sharing in a Sysplex Connecting a large number of systems together brings with it special considerations, such as how the large number.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Fabric Interfaces Architecture Sean Hefty - Intel Corporation.

Processor Architecture

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 4: Threads Modified from the slides of the text book. TY, Sept 2010.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition, Chapter 4: Multithreaded Programming.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.

Processes and Virtual Memory

Endpoints Plenary James Dinan Hybrid Working Group December 10, 2013.

MPI Communicator Assertions Jim Dinan Point-to-Point WG March 2015 MPI Forum Meeting.

Chapter 2 Database Environment.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Parallel Computing Presented by Justin Reschke

NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.

Programming Parallel Hardware using MPJ Express By A. Shafi.

Lecturer 3: Processes multithreaded Operating System Concepts Process Concept Process Scheduling Operation on Processes Cooperating Processes Interprocess.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Chapter 4: Threads.

COMPONENT & DEPLOYMENT DIAGRAMS

Fabric Interfaces Architecture – v4

Computer Engg, IIT(BHU)

Chapter 4: Threads.

Prof. Leonardo Mostarda University of Camerino

Chapter 4: Threads.

Presentation transcript:

A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan

Exascale – hardware trends Hardware design driven by power limits Hardware increasingly has deep hierarchy –Nodes -> CPUs -> cores -> SIMD vectors –Nodes -> GPUs -> SMs -> cores -> warps Hardware increasingly has varied locality –Multiple levels of NUMA regions –Multiple data paths: cache-coherency, RDMA –Moving data from “far away” to “near” is costly 1 of 20

Exascale – hybrid programming Unclear if pure MPI can scale to exascale Exploration of MPI+X, i.e. hybrid program Many options for X (MPI, OpenMP, PGAS) Most options for X involve OS threads Interoperability between MPI and threads? –Good: thread support defined in MPI Standard –Bad: multi-thread support has high overheads –Ugly: addressability, i.e. identifying threads 2 of 20

An MPI process is a logical PE MPI defines a flat hierarchy of processing elements: MPI processes This is a programming model concept –An MPI process is defined only by MPI function calls and their semantics It is not a programming system construct –c.f. OS process or POSIX thread Increasingly MPI processes are a bad model for complex hardware 3 of 20

Current thread support in MPI Specified in External Interfaces chapter –Threads & OS processes are external to MPI –Focused on implications for MPI library writers Four thread support levels: – MPI_THREAD_SINGLE Only one thread will exist – MPI_THREAD_FUNNELLED Only one thread will access MPI – MPI_THREAD_SERIALIZED Only one thread will access MPI at one time – MPI_THREAD_MULTIPLE No restrictions: any thread(s) can access MPI any time 4 of 20

Why have several levels? Implementation methods for MPI process concept impose practical considerations –Multiple: MPI must be fully thread-safe, protection of shared state “requires locks” –Serialised: MPI does not need to protect shared state from concurrent accesses –Funnelled: MPI can use thread-local features –Single: MPI free to use code even if it is not thread-safe 5 of 20

Threading issues for users 1 Collectives: manual coding of hierarchy –E.g. each OpenMP thread provides a reduction value –All threads want to call MPI_ALLREDUCE –But only one thread allowed in MPI collective –So first, do OpenMP parallel reduction Adds thread synchronisation overhead –Then, MPI reduction in OpenMP single region Adds thread load-imbalance overhead 6 of 20

Threading issues for users 2 Point-to-point: addressing each thread requires different tags or communicators –Communicators: requires too many for full addressability or general connectivity –Tags: no way for MPI to know relationship between tags and threads => shared-state Single shared unexpected send queue Single shared unmatched receive queue Access must be serialised, by user or by MPI 7 of 20

Summary of MPI endpoints Additional logical PEs – MPI “processes” –Hierarchically associated with an MPI process –Addressable by rank in the group of a new communicator –Act and react like normal MPI processes Except for MPI_INIT_THREAD & MPI_FINALIZE –Array of communicator handles returned by communicator creation function Each returns a different MPI_COMM_RANK value 8 of 20

Modelling advantages Processing elements that share an address-space can be modelled in the application code using MPI endpoints Flexible; threads communicate via MPI #pragma omp parallel MPI_Comm_create_endpoints(numThreads); #pragma omp for for () {do_calc(); MPI_Neighbourhood_alltoall();} 9 of 20

New Optimisations Possible Shared state can be divided per endpoint –Example provided by proxy job demonstrator Dedicated resources for each endpoint –Separate queues per endpoint –{communicator, target rank} identifies context –If only one thread ever uses an endpoint then the endpoint usage like “funnelled” definition –If only one thread at a time uses an endpoint the endpoint usage like “serialised” definition 10 of 20

Application behaviour restriction One MPI endpoint per OS thread anticipated to be a common use case How does user inform MPI the application will restrict usage of threads & endpoints? –New INFO key supplied to communicator creation function –Specifying new thread support levels –Enhancing existing thread support levels 11 of 20

Hinting at behaviour restriction Current INFO keys are hints to the MPI –MPI library can ignore all INFO keys –MPI library not allowed to modify semantics based on INFO key information –User can lie! –MPI must check hint accuracy MPI would still need to be initialised with MPI_THREAD_MULTIPLE Considered but discounted 12 of 20

New thread support levels 1 Additional thread support levels to extend (ideas that pre-date endpoints proposal) – MPI_THREAD_AS_RANK Each thread calls MPI_INIT and becomes an MPI process with its own rank in MPI_COMM_WORLD (like “single” but excludes thread unsafe code) – MPI_THREAD_FILTERED Some threads call MPI_INIT and become MPI processes, others delegate calling of MPI functions to an initialised thread (similar to “funnelled”) 13 of 20

New thread support levels 2 Additional thread support levels to extend definition of MPI_THREAD_FUNNELED – MPI_THREAD_FILTERED One thread calls MPI_INIT and is an MPI process MPI process creates multiple endpoints All threads can call MPI at the same time if they all use different endpoints Only one thread can use any particular endpoint 14 of 20

New thread support levels 3 Additional thread support levels to extend definition of MPI_THREAD_SERIALIZED – MPI_THREAD_SERIAL_EP One thread calls MPI_INIT and is MPI process MPI process creates multiple endpoints All threads can call MPI at the same time if they all use different endpoints Any thread can use any endpoint but only one thread can use any particular endpoint at a time 15 of 20

Alter old thread support levels Add "per endpoint" wording to funnelled and serialised definitions –Backward compatible because all existing MPI processes have exactly one MPI endpoint with no possibility to create more –Intention is clear; precise wording is still undergoing active discussion –Funnelled definition linked to definition of “main thread”; should now be “main threads”? 16 of 20

Suggested MPI Standard Text 1 MPI_THREAD_FUNNELED –The process may be multi-threaded, but the application must ensure that only the main thread makes MPI calls (for the definition of main thread, see MPI_IS_THREAD_MAIN ). Main thread or first uses a communicator handle returned by MPI_COMM_CREATE_ENDPOINTS –The thread that called MPI_INIT or MPI_INIT_THREAD or first uses a communicator handle returned by MPI_COMM_CREATE_ENDPOINTS 17 of 20

Suggested MPI Standard Text 2 MPI_THREAD_SERIALIZED concurrent per endpoint using a single endpoint for each endpoint –The process may be multi-threaded, and multiple threads may make concurrent MPI calls, but only one at a time per endpoint: MPI calls using a single endpoint are not made concurrently from two distinct threads (all MPI calls for each endpoint are "serialized"). 18 of 20

Implementation Issues MPI_COMM_CREATE_ENDPOINTS can generate MPI_ERR_ENDPOINTS –if it cannot create/support the required number of endpoints (in existing proposal) For new “funnelled” and “serialised” levels –this will happen when MPI cannot provide isolated, dedicated resources for each new endpoint and cannot silently degrade to “multiple”-like implementation 19 of 20

Summary MPI endpoints introduces hierarchy of logical PEs that share an address-space –Enables flexible new mappings of logical PEs (MPI processes/endpoints) to physical PEs (OS processes/threads) Modifying “funnelled” & “serialised” levels –Extends their usefulness to more threads –Delays the time when “multiple” is needed –Backwards compatible 20 of 20