Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory.

Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory

Pavan Balaji, Argonne National Laboratory Many-core Architectures  Massively parallel  Massively parallel environment  Intel® Xeon Phi co-processor –60 cores 240 hardware threads –60 cores inside a single chip, 240 hardware threads –SELF-HOSTINGNATIVE –SELF-HOSTING in next generation, NATIVE mode in current version  Blue Gene/Q –16 cores 64 hardware threads –16 cores per node, 64 hardware threads  Lots of “light-weight” cores is becoming a common model INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Hybrid OpenMP + MPI Programming  Multi-threads of a process shares local resources  Parallelize local computation more efficiently  MPI between nodes Network MPI Process 0 OpenMP Threads MPI Process 1 OpenMP Threads MPI Process 2 OpenMP Threads MPI Process 3 OpenMP Threads INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Four levels of MPI Thread Safety  MPI_THREAD_SINGLE –MPI only, no threads  MPI_THREAD_FUNNELED –Outside master –Outside OpenMP parallel region, or OpenMP master region  MPI_THREAD_SERIALIZED –Outside single critical –Outside OpenMP parallel region, or OpenMP single region, or critical region  MPI_THREAD_MULTIPLE –Any thread is allowed to make MPI calls at any time #pragma omp parallel for for (i = 0; i < N; i++) { uu[i] = (u[i] + u[i - 1] + u[i + 1])/5.0; } MPI_Function ( ); INRIA/UIUC/Argonne Workshop (11/22/2013) #pragma omp parallel { /* user computation */ #pragma omp single MPI_Function (); } #pragma omp parallel { /* user computation */ #pragma omp critical MPI_Function (); }

Pavan Balaji, Argonne National Laboratory Problem: Idle Resources during MPI Calls  Threads are only active in the computation phase IDLE  Threads are IDLE during MPI calls #pragma omp parallel for for (i = 0; i < N; i++) { uu[i] = (u[i] + u[i - 1] + u[i + 1])/5.0; } MPI_Function ( ); Master #pragma omp parallel { /* user computation */ #pragma omp single MPI_Function (); } (a) Funneled mode (b) Serialized mode MPI CALL INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Our Approach: Sharing Idle Threads with Application inside MPI #pragma omp parallel { /* user computation */ } MPI_Function () { #pragma omp parallel { /* MPI internal task */ } Master (a) Funneled mode (b) Serialized mode #pragma omp parallel { /* user computation */ #pragma omp single MPI_Function () { #pragma omp parallel { /* MPI internal task */ } MPI CALL INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Challenges UNKNOWN  Some parallel algorithms are not efficient with insufficient threads, need tradeoff, but the number of available threads is UNKNOWN !  Nested parallelism –Simply creates new Pthreads OVERRUNNING –Offloads thread scheduling to OS, caused threads OVERRUNNING issue #pragma omp parallel { /* user computation */ #pragma omp single MPI_Function( ){ … } MPI CALL (a) Unknown number of IDLE threads #pragma omp parallel { #pragma omp single { #pragma omp parallel { … } } (b) Threads overrunning Creates N Pthreads INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory OpenMP Runtime Extension 1 #pragma omp parallel #pragma omp single { num_threads(omp_get_num_idle_threads()) #pragma omp parallel num_threads(omp_get_num_idle_threads()) { … } } INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory OpenMP Runtime Extension 2  Waiting progress in barrier –SPIN LOOP until timeout ! –May cause OVERSUBSCRIBING  Solution: Force waiting threads to enter in a passive wait mode inside MPI –set_fast_yield (sched_yield) –set_fast_sleep (pthread_cond_wait) while (time < KMP_BLOCKTIME){ if (done) break; /* spin loop */ } pthread_cond_wait (…); # pragma omp parallel # pragma omp single { set_fast_yield (1); #pragma omp parallel { … } } INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory OpenMP Runtime Extension 2  set_fast_sleep VS set_fast_yield –Test bed: Intel Xeon Phi cards (stepping B0, 61 cores) overhead (us) KMP_BLOCKTIME overhead (us) KMP_BLOCKTIME set_fast_sleep set_fast_yield INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Derived Data Type Packing Processing  MPI_Pack / MPI_Unpack  Communication using Derived Data Type non-contiguous –Transfer non-contiguous data –Pack / unpack data internally #pragma omp parallel for for (i=0; i<count; i++){ dest[i] = src[i * stride]; } blocklength count stride INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Prefetching issue when compiler vectorized non-contiguous data for (i=0; i<count; i++){ *dest++ = *src; src += stride; } #pragma omp parallel for for (i=0; i<count; i++){ dest[i] = src[i * stride]; } (a) Sequential implementation (not vectorized)(b) Parallel implementation (vectorized) INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Sequential pipelining LMT  Shared user space buffer  Pipelining copy on both sender side and receiver side Shared Buffer Cell[0] Cell[1] Cell[2] Cell[3] User Buffer SenderReceiver Sender Get a EMTPY cell from shared buffer, and copies data into this cell, and marks the cell FULL; Then, fill next cell. Receiver Get a FULL cell from shared buffer, then copies the data out, and marks the cell EMTPY ; Then, clear next cell. INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Parallel pipelining LMT as many available cells as we can  Get as many available cells as we can  Parallelizing large data movement Shared Buffer Cell[0] Cell[1] Cell[2] Cell[3] User Buffer SenderReceiver Shared Buffer Cell[0] Cell[1] Cell[2] Cell[3] User Buffer Sender Receiver (a) Sequential Pipelining(b) Parallel pipelining INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Sequential Pipelining VS Parallelism  Small Data transferring ( < 128K ) –Threads synchronization overhead –Threads synchronization overhead > parallel improvement  Large Data transferring –Data transferred using Sequential Fine-Grained Pipelining a few of threads (worse) –Data transferred using Parallelism with only a few of threads (worse) many threads (better) –Data transferred using Parallelism with many threads (better) Sender Buffer Shared Buffer Receiver Buffer INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory InfiniBand Communication  Structures –IB context –Protection Domain –Queue Pair (critical) 1 QP per connection –Completion Queue (critical) Shared by 1 or more QPs  RDMA communication –Post RDMA operation to QP –Poll completion from CQ  Internally supports Multi-threading  OpenMP contention issue P0 P1 P2 QP CQ PD HCA IB CTX ADI3 CH3 SHMnemesis TCP IB … INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory  Two level parallel policies  Parallelize the operations to different IB CTXs  Parallelize the operations to different CQs / QPs Parallel InfiniBand communication HCA QP CQ IB CTX QP CQ IB CTX QP CQ IB CTX HCA CQ IB CTX QP CQ QP (a) Parallelism on different IB CTXs (b) Parallelism on different CQs / QPs INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Parallelize InfiniBand Small Data Transfer ib_write_bw  3 parallelism experiments based on ib_write_bw : 1.1 process per node, 32 IB CTX per process, 1 QP + 1 CQ per IB CTX 2.1 process per node, 1 IB CTX per process, 32 QPs + 32 CQs per IB CTX 3.1 process per node, 1 IB CTX per process, 32 QPs + 1 shared CQ per IB CTX Number of Threads / Processes BW Improvement Test bed: Intel Xeon Phi cards (stepping B1, 60 cores), InfiniBand QDR Data size: 2 Bytes INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory  Basic components –Local RDMA buffer pool –Remote RDMA buffer pool P0 Connection to P1 MPI IB netmod P1 Connection to P0 QP Shared CQ QP Remote RDMA buffer pool Shared CQ Local RDMA buffer pool ADI3 CH3 SHMnemesis TCP IB … INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Eager Message Transferring in IB netmod  When send many small messages –Limited IB resources QP, CQ, remote RDMA bufferQP, CQ, remote RDMA buffer SendQ –Most of the messages are enqueued into SendQ sendQ –All sendQ messages are sent out in wait progress  Major steps in wait progress –Clean up issued requests –Receiving Poll RDMA-buffer Copy received messages out –Sending Copy sending messages from user buffer Issue RDMA op P0 Enqueue messages SendQ into SendQ wait progress start Send some messages immediately SendQ Send messages in SendQ P1 SYNC wait progress start SYNC wait progress end … SendQ Send messages in SendQ Receive message Parallelizable Clean up request INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Parallel Eager protocol in IB netmod  Parallel policy different connections –Parallelize large set of messages sending to different connections Different QPs : Sending processingDifferent QPs : Sending processing –Copy sending messages from user buffer –Issue RDMA op Different RDMA buffers : Receiving processingDifferent RDMA buffers : Receiving processing –Poll RDMA-buffer –Copy received messages out INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory  Feature –Large amount of small non-blocking RMA operations sending to many targets MPI_Win_fence –Wait ALL the completion at the second synchronization call ( MPI_Win_fence )  MPICH implementation –Issue all the operations in the second synchronization call Example: One-sided Communication PUT 0 Origin Process Target Process 2 Target Process 1 PUT 1 PUT 2 PUT 3 PUT 4 PUT 5 MPI_Win_fence MPI_Win_fence Th 0 Th 1 Th 0 Th 1 INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Parallel One-sided communication  Challenges in parallelism –Global SendQ  Group messages by targets –Queue structures  Stored in [Ring Array + Head / Tail ptr] –Netmod internal messages (SYNC etc.)  Enqueue to another SendQ (SYNC-SendQ) P0 12 3 0 Send-Arrays SYNC-SendQ IN OUT INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Parallel One-sided communication  Optimization –Every OP is issued through long critical section Issue all Ops togetherIssue all Ops together –Create large number of Requests Only create one requestOnly create one request ADI3 CH3 SHM nemesis IB PUT SendQ Critical Fig. Issue RMA operations from CH3 to IB netmod INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Evaluation: Datatype-related Functions  Parallel 2D matrix packing –Fixed area size and varying X and Y dimensions –Element type : double Number of Threads Speedup 01234 56789 1011121314 1516171819 2021222324 X Y Element count on Y dimension INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Inter-node 2D Halo Exchange INRIA/UIUC/Argonne Workshop (11/22/2013) Number of Threads Speedup Number of Threads Speedup 2 MPI processes on 2 nodes9 MPI processes on 9 nodes P0P1

Pavan Balaji, Argonne National Laboratory Intra-node Large Message Communication INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory One-sided Operations and Low-Level Optimizations  Micro benchmark –One to All experiment using 9 processes Process 0 sends many MPI_PUT operations to all the other processes –All to All experiment using 9 processes Every process sends many MPI_PUT operations to the other processes Number of Threads Improvement percentage (%) Number of Threads Improvement percentage (%) One to All All to All INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Graph500 benchmark  Kernels –Graph Construction –Breadth-First Search (BFS) MPI_Accumulate operations –Validation 1 4 23 5 67 89 Mean Time (sec) Harmonic Mean TEPS Number of Threads INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory MPI_THREAD_MULTIPLE optimizations  Global –Use a single global mutex, held from function enter to exit –Existing implementation  Brief Global –Use a single global mutex, but reduce the size of the critical section as much as possible  Per-Object –Use one mutex per data object: lock data not code sections  Lock-Free –Use no mutexes –Use lock-free algorithms INRIA/UIUC/Argonne Workshop (11/22/2013) ProcessesThreads

Pavan Balaji, Argonne National Laboratory Several Optimizations Later…  Reduction of lock granularities  Thread-local pools to reduce sharing  Per-object locks  Some atomic operations (reference counts)  But the performance scaling was still suboptimal INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Lock Contention in a Fully-Threaded MPI Model  Hybrid Fully-Threaded MPI –Threads can make MPI calls concurrently –Thread-safety is necessary MPI Process Thread1 Thread2 MPI_Init_thread(…,MPI_THREAD_MULTIPLE,…);. #pragma omp parallel { /* Do Work */ MPI_Put(); /* Do Work */ } MPI_Put() ENTER_CS() EXIT_CS() ENTER_CS() EXIT_CS() 1.Different algorithms for multithreaded mode 2.Thread-safety in MPI: Critical sections can be protected: –By Locks (Pthread_mutex_t) –Using Lock-Free algorithms (Atomic Compare- and-Swap)  Possible Contention ! Thread Sleeping/ Polling CONTENTION INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Characterizing Lock Contention in Hybrid MPI Applications  Contention Metric: Wasted Polls  Test scenarios: –Micro-benchmarks –Graph 500 benchmark –HPC applications #pragma omp parallel { for(i=0; i< NITER; i++) { MPI_Put(); /*Delay X us*/ Delay(X) } INRIA/UIUC/Argonne Workshop (11/22/2013) “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” – Sherlock Holmes, Sign of Four, Sir Arthur Conan Doyle

Pavan Balaji, Argonne National Laboratory Argobots: Integrated Computation and Data Movement with Lightweight Work Units  Execution Model –Execution stream: a thread executed on a hardware processing element –Work units: a user level thread or a tasklet with a function pointer  Memory Model –Memory Domains: A memory consistency call on a big domain also impacts all internal domains –Synchronization: explicit & implicit memory consistency calls –Network: PUT/GET, atomic ops Load/Store (LS) Domain (LS0) Noncoherent Load/Store (NCLS) Domain LS Noncoherent memory Put Get Consistency Domain (CD0) CD consistent memory Consistency Domain (CD1) CD consistent memory LS Noncoherent memory Work Units Execution Stream INRIA/UIUC/Argonne Workshop (11/22/2013)

Pavan Balaji, Argonne National Laboratory Argobots Ongoing Works: Fine-grained Context- aware Thread Library  Two Level of Threads –execution stream: a normal thread –work unit: a user level thread or a tasklet with a function pointer  Avoid Unnecessary lock/unlock –Case 1: switch the execution to another work unit in the same execution stream without unlock –Case 2: switch to another execution stream, call unlock  Scheduling Work Units in Batch Order –work units in the same execution stream will be batch executed INRIA/UIUC/Argonne Workshop (11/22/2013) Work Units Execution Streams w1w1 w3w3 w5w5 e1e1 e2e2 1 2

Pavan Balaji, Argonne National Laboratory Take Away  Massive compute parallelism is inevitable  MPI requires a lot of study to “play well” with these systems  Several such optimizations can be done within the MPI implementation  Some require changes to the MPI standard  Need more understanding of how things work before we try to standardize it INRIA/UIUC/Argonne Workshop (11/22/2013)

Thank You!

Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory.

Similar presentations

Presentation on theme: "Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory.

Similar presentations

Presentation on theme: "Message Passing in Massively Multithreaded Environments Pavan Balaji Computer Scientist and Project Lead Argonne National Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback